booking unicode

9.13: Validate input# character encoding:

. this is an expansion of
Wheeler's how-to on secure programming

. staying in control of character encoding
can determine your effectiveness in filtering web content
for preventing cross-site scripting and semantic attacks .
when needing to process older documents
in various older (language-specific) character sets,
you need to ensure that an untrusted user cannot control
the setting of another document's character set .
. this page will explain how the various char'encodings can be tricky,
and how mistranslation can affect security .

. UCS (Universal Character Set) is the new standard (subsuming ASCII);
it's the function { integer -> image } for all the images of the world,
requiring more than 2-bytes and less than 4-bytes .

. UTF (UCS Transformation Format) is the generic name for
how to code the UCS -- there are several approaches:
the efficient and backward-compatable way is to use a variant record;
eg, UTF-8:
variant#1 is ascii in a single byte;
variant#{2..4} can be contained in {2..4} bytes, respectively .

. alt'ly, UTF-32 is simply a 4-byte integer
that maps directly to UCS values -- no decoding is needed .

. for handling international characters UTF-8 has become the standard;
. an attacker can exploit an illegal UTF-8 value
by sneaking metacharacters into the string
which can turn parameter names into arbitrary commands .

9.14: 9.15:
. for instance, the wrong way to decode the invalid UTF-8 value
0,C0,80 #16 = (110)0,0000,(10)00,0000 #2
is to remove all the discriminant bits (shown in parentheses),
and return the remaining bits (00#16);
because, the discriminant"(110) was telling you
the remaining bits should have been within 0080 ... 07FF #16;
if the result wasn't in that range, the value is illegal,
and the translator should warn the client .
. another example relevant to security
would be filtering to prohibit the char.string that
give a unix filename reference to the parent directory "/../":
/ . . /
2F,2E, 2E,2F#16 . weak parsing could still be caught by this:
2F,C0AE,2E,2F #16 . the same string with 2E replaced by C0AE .
. then the client might misparse this invalid UTF-8 to: 2E#16
when in fact it is an illegal value .
. C0 AE = (110)0,000,(10)10,1110 -utf-8-> 10,1110#2 = 2E#16
(illegal value because for discriminant (110)
the value should be greater than 80#16, and 2E is not ) .
. another complication is that
there are some obsolete formats to watch for:
not only does the UTF-8 conversion have to check that it's
holding the correct range of UCS values,
it also has to check for the format being UTF-16:
convert UCS-2 to UTF-8:
. UCS-2 is just the first 2 bytes of 4-byte UCS;
so the UCS conversion can be used by
extending each UCS-2 character with two zero-valued bytes: 00,00,xx,xx#16 ;
pairs of UCS-2 values between D800 and DFFF need special treatment:
because, they are "(surrogate pairs)
being actually UCS-4 characters transformed through UTF-16:
. convert the UTF-16 to UTF-32,
then convert the UTF-32 to UTF-8 .

00 .. 1F -- C0 set
80 .. 9F -- C1 set
for compatibility with the ISO/IEC 2022 framework
-- 7F (delete) is also a sort of control char .

9.15: BOM and unicode signature U+FEFF:

. a BOM (byte order mark) is the value U+FEFF
prepended to the unicode char.string,
thereby acting as the signature of a Unicode.string
and indicating the byte ordering:
. here is the format of U+FEFF per UTF form and byte ordering:
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
-- the UTF-8 format is byte-oriented,
so it is unaffected by byte ordering .
. when not at the beginning of a text stream, U+FEFF
should normally not occur.
For backwards compatibility it should be treated as
and is then part of the content of the file or string.
The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP
for expressing word joining semantics
since it cannot be confused with a BOM.
When designing a markup language or data protocol,
the use of U+FEFF can be restricted to that of Byte Order Mark.
In that case, any FEFF occurring in the middle of a file
can be treated as an unsupported character (zero-width invisible characters)
. if you are sending a string to another node on the net,
or an unfamiliar app on your node,
consider the BOM-related values {0,FFFE#16, 0,FFFF#16} invalid:
remove them from everywhere but the beginning of the string .
. the use of this reserve for BOM implies
all these are non-numbers:
U+{00... 10}{FF,FE... FF,FF}

. values in 0,D800 #16 .... 0,DFFF #16
are specifically reserved for use with UTF-16,
and don't have any characters assigned to them.
. { 0,FF,FE#16, 0,FE,FF#16 } are not Unicode char's precisely to preserve
their usefulness of as a BOM (byte-order mark)
--. either value prepended to a char'stream
is also a hint that it's unicode;
though the BOM can also be followed by a "UTF-16" tag:
eg,Big-endian text labelled with UTF-16, with a BOM:
FE FF D8 08 DF 45 00 3D 00 52 ... ;
Little-endian text labelled with UTF-16, with a BOM:
FF FE 08 D8 45 DF 3D 00 52 00 ... .
. the tags { UTF-16BE"(big-endian), "UTF-16LE"(little-endian), }
can be used in place of the BOM;
-- if the BOM is not at the beginning of the stream,
its meaning changes to a [zero-width non-breaking space] .
. FDD0...FDEF #16 represent noncharacters.
Unpaired surrogates are invalid as well,
i.e. any value in D800 ... DBFF #16
not followed by a DC00 ... DFFF #16,
or any value in DC00 ... DFFF #16
not preceded by a D800 ... DBFF #16 .
how unicode characters are encoded in UTF-16

. for (values <= 10,FFFF#16):
value < 1,00,00#16 ?
value may be contained in a single 16-bit integer .
value in 1,0000#16 ... 10,FFFF#16 ?
value - 1,0000#16 ( 0,yyyy,yyyy,yyxx,xxxx,xxxx #2 )
is returned in 2 words:
word#1 = 1101,10yy,yyyy,yyyy #2
word#2 = 1101,11xx,xxxx,xxxx #2

encoding ISO 10646 as UTF-16(U <= 0x10FFFF):
U < 0x10000 ?
return U as a 16-bit unsigned integer .
Let U' = U - 0x10000.
-- assert U <= 010,FFFF#16, implying U' <= 0F,FFFF#16 (fitting in 20 bits)
word#1`= 0xD800 -- assert word#1 = 1101,10yy,yyyy,yyyy
word#2`= 0xDC00 -- assert word#2 = 1101,11xx,xxxx,xxxx
word#1`[10 low-order bits]`= [U']`[10 high-order bits]
word#2`[10 low-order bits]`= [U']`[10 low-order bits]
return (word#1, word#2) .
Graphically, steps 2 through 4 look like:
U' = yyyy,yyyy,yyxx,xxxx,xxxx
W1 = 1101,10yy,yyyy,yyyy
W2 = 1101,11xx,xxxx,xxxx
U+D800 ... U+DFFF = 1101,1xxx,xxxx,xxxx

decoding ISO 10646 UTF-16 (U):
. word#1 not in 0xD800 ... 0xDFFF ?
return word#1
word#1 not in 0xD800 ... 0xDBFF ?
return error pointer to the value of W1
word#2 null or not in 0xDC00 ... 0xDFFF?
return error pointer to the value of W1
return (word#1`[10 low-order bits] & word#2`[10 low-order bits])
+ 0x10000
) .

Security Considerations
in UTF-16 may contain special characters,
such as the [object replacement character] (0xFFFC),
that might cause external processing,
depending on the interpretation of the processing program
and the availability of an external data stream .
This external processing may have side-effects
that allow the sender of a message to attack the receiving system.

Implementors of UTF-16 need to consider
the security aspects of how they handle illegal UTF-16 sequences
(that is, sequences involving surrogate pairs
that have illegal values or unpaired surrogates).
It is conceivable that in some circumstances
an attacker would be able to exploit an incautious UTF-16 parser
by sending it an octet sequence that is not permitted by the UTF-16 syntax,
causing it to behave in some anomalous fashion.

UTF-16 extends UCS-2 by using 2words;
but not in the same way that UCS-4 does .

. UTF-32 is incompatible with ASCII files;
because, they contain many nul bytes;
implying the strings cannot be manipulated by normal C string handling .
Therefore most UTF-16 systems such as Windows and Java
represent text objects such as program code with 8-bit encodings
(ASCII, ISO-8859-1, or UTF-8), not UTF-16.
This introduces a serious complication in programming
that is often overlooked by system designers:
many 8-bit encodings (in particular UTF-8)
can contain invalid sequences that cannot be translated to UTF-16
and thus the file can contain a superset of the valid data.
For instance a UTF-8 URL can name a location
that cannot correspond to a file on the system,
or two different files may compare identical,
or reading and writing a file can change it.
One of the few counterexamples of a UTF-16 file is the "strings" file
used by Mac OS 10.3+ applications
for lookup of internationalized versions of messages,
these default to UTF-16
and "(files encoded using UTF-8 are not guaranteed to work.
When in doubt, encode the file using UTF-16).
Oddly enough, OSX is not a UTF-16 system.

ÒUnicodeÓ was originally BMP (U+0000 ... U+FFFF
-- Basic Multilingual Plane).
When it became clear that more than 64k characters would be needed,
Unicode was turned into a sort of 21-bit character set
range U-0000,0000 ... U-0010,FFFF .
The 2*1024 surrogate characters (U+D800 ... U+DFFF)
were introduced into the BMP to allow
1024*1024 non-BMP characters to be represented as
a sequence of two 16-bit surrogate characters.
This way UTF-16 was born,
which represents the extended Ò21-bitÓ Unicode
in a way backwards compatible with UCS-2.
The term UTF-32 was introduced in Unicode to describe
a 4-byte encoding of the extended Ò21-bitÓ Unicode.
. UTF-32 is the exact same thing as UCS-4,
except that by definition
UTF-32 is never used to represent characters above U-0010FFFF,
while UCS-4 can cover all 231 code positions up to U-7FFFFFFF.
The ISO 10646 working group has agreed to modify their standard to exclude
code positions beyond U-0010,FFFF,
in order to turn the new UCS-4 and UTF-32 into practically the same thing.

. some phrases can be expressed in more than one way in ISO 10646/Unicode.
For example, some accented characters can be represented as a single character
(with the accent) and also as a set of characters
(e.g., the base character plus a separate composing accent).
These two forms may appear identical.
There's also a zero-width space
with the result that apparently-similar items are considered different.
Beware of situations where such hidden text could interfere with the program.
This is an issue that in general is hard to solve;
most programs don't have such tight control over the clients
that they know completely how a particular sequence will be displayed
(since this depends on the client's font, display characteristics, locale, ...).


In UTF-8, characters from the range U+00,00 ... U+10,FF,FF
(the UTF-16 accessible range)
are encoded using sequences of 1 to 4 octets.
table for converting UTF-32 to UTF-8:
UTF-32 range (hex.) UTF-8 octet sequence (binary)
00... 7F 0xxxxxxx
80... 07,FF 110xxxxx 10xxxxxx
08,00... 0,FF,FF 1110xxxx 10xxxxxx 10xxxxxx
01,00,00... 1F,FF,FF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

C0#16 = 1100,0000#2
C1#16 = 1100,0001#2
F5#16 = 1111,0101#2
FF#16 = 1111,1111#2
D8,00 ... DF,FF -- illegal: control codes used by UTF-16:
. do UTF-16 -> UTF-32 -> UTF-8 .
This contrasts with CESU-8 [CESU-8],
which is a UTF-8-like encoding
that is not meant for use on the Internet.
CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values
(16-bit quantities) instead of the character number (code point).
This leads to different results for character numbers above 0xFFFF;
the CESU-8 encoding of those characters is NOT valid UTF-8.

The definition of UTF-8 prohibits encoding character numbers between
U+, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters. When encoding in UTF-8 from UTF-16 data, it is necessary
to first decode the UTF-16 data to obtain character numbers, which
are then encoded in UTF-8 as described above.
. thus, it is easy to convert to utf-8;
it is considerably less simple, however,
to validate that a utf-8 code was converted correctly;
. the code positions U+D800 to U+DFFF (UTF-16 surrogates)
as well as U+FFFE and U+FFFF
must not occur in normal UTF-8 or UCS-4 data.
(binary view of illegal chars)
U+D800 1101,1000,0000,0000
U+DFFF 1101,1111,1111,1111
U+FFFE 1111,1111,1111,1110
U+FFFF 1111,1111,1111,1111
. for safety reasons, UTF-8 decoders should treat them like
malformed or over-long sequences .

. the table above says this like so:
. say the 2..4 bytes have bits named bit#1 ... bit#32:
and bit#{ 1, 2 }  ?
assert bit#3 = 0, and byte#2 starts with 10.base2;
bit#{ 4 ... 8} & bit#{ 11 .. 16} --[ should be in range 0080 ... 07FF ]
< 80.base16 ? raise illegal.coding.error
else: and bit#{ 1, 2, 3 } ?
assert bit#4 = 0, and byte#{2,3} starts with 10.base2;
bit#{ 5 ... 8} & bit#{ 11 .. 16} & bit#{ 19 ... 24 }
--[should be in range 0800 ... FFFF ]
< 800.base16 ? raise illegal.coding.error
else: and bit#{ 1, 2, 3, 4} ?
assert bit#5 = 0, and byte#{2,3,4} starts with 10.base2;
bit#{ 6 ... 8} & bit#{ 11 .. 16} & bit#{ 19 ... 24 } & bit#{ 27 ... 32 }
--[should be in range 1,0000 ... 1F,FFFF ]
< 1,0000.base16 ? raise illegal.coding.error

. among a list of desiderata for any such encoding,
is the ability to synchronize a byte stream picked up mid-run,
with less that one character being consumed before synchronization.
. The model for multibyte processing has it that
ASCII does not occur anywhere in a multibyte encoding.
There should be no ASCII code values for any part of
a UTF representation of a character that was not in the
ASCII character set in the UCS representation of the character.
Historical file systems disallow the null byte and the ASCII slash character
as a part of the file name

1) zero the 4 octets (bytes) of the UCS-4 character
2) Determine which bits encode the character value
from the number of octets in the sequence
and the second column of the table above (the bits marked x).
3) Distribute the bits from the sequence to the UCS-4 character,
first the lower-order bits from the last octet of the sequence and
proceeding to the left until no x bits are left.
If the UTF-8 sequence is no more than three octets long,
decoding can proceed directly to UCS-2.

. UCS (Universal Character Set) is defined by ISO/IEC 10646
as having multiple octets (bytes) .
. UTF-8 (UCS transformation format -- 8bit)
is a way of staying compatable with single-byte ASCII
by using a 1...4-byte variable-length for describing
the same character set as the 4-byte UCS .

. CESU-8 (8-bit Compatibility Encoding Scheme for UTF-16)
is intended for internal use within systems processing Unicode
in order to provide an ASCII-compatible 8-bit encoding
that is similar to UTF-8 but preserves UTF-16 binary collation.
It is not intended nor recommended as an encoding used for open information exchange.
CESU-8 Bit Distribution
UTF-16 Code byte#1 byte#2 byte#3
0000,0000,0xxx,xxxx 0xxxxxxx
0000,0yyy,yyxx,xxxx 110yyyyy 10xxxxxx
zzzz,yyyy,yyxx,xxxx 1110zzzz 10yyyyyy 10xxxxxx

important features of this encoding form:
The CESU-8 representation of characters on the Basic Multilingual Plane (BMP)
is identical to the representation of these characters in UTF-8.
Only the representation of supplementary characters differs.
Only the six-byte form of supplementary characters is legal in CESU-8;
the four-byte UTF-8 style supplementary character sequence is illegal.
A binary collation of data encoded in CESU-8
is identical to the binary collation of the same data encoded in UTF-16.
As a very small percentage of characters in a typical data stream
are expected to be supplementary characters,
there is a strong possibility that CESU-8 data may be misinterpreted as UTF-8.
all use of CESU-8 outside closed implementations is strongly discouraged,
such as the emittance of CESU-8 in output files,
markup language or other open transmission forms.

converting {binary, hex}:
1000 8
1001 9
1010 10 A
1011 11 B
1100 12 C
1101 13 D
1110 14 E
1111 15 F

. for more examples, see wikipedia:

(binary view of illegal chars)
U+D800 ... U+DFFF = 1101,1xxx,xxxx,xxxx
these are for the 2 words in UTF-16:
word#1 = 1101,10yy,yyyy,yyyy
word#2 = 1101,11xx,xxxx,xxxx

U+FFFE 1111,1111,1111,1110
U+FFFF 1111,1111,1111,1111

see Unicode Technical Report #36, Unicode Security Considerations,
and Unicode Technical Standard #39, Unicode Security Mechanisms.