Americium Dream Documents: booking unicode

9.13: Validate input# character encoding:

[12.29:

. this is an expansion of

Wheeler's how-to on secure programming

]

[9.14:

. staying in control of character encoding

can determine your effectiveness in filtering web content

for preventing cross-site scripting and semantic attacks .

]

when needing to process older documents

in various older (language-specific) character sets,

you need to ensure that an untrusted user cannot control

the setting of another document's character set .

[9.15:

. this page will explain how the various char'encodings can be tricky,

and how mistranslation can affect security .

. UCS (Universal Character Set) is the new standard (subsuming ASCII);

it's the function { integer -> image } for all the images of the world,

requiring more than 2-bytes and less than 4-bytes .

. UTF (UCS Transformation Format) is the generic name for

how to code the UCS -- there are several approaches:

the efficient and backward-compatable way is to use a variant record;

eg, UTF-8:

variant#1 is ascii in a single byte;

variant#{2..4} can be contained in {2..4} bytes, respectively .

. alt'ly, UTF-32 is simply a 4-byte integer

that maps directly to UCS values -- no decoding is needed .

. for handling international characters UTF-8 has become the standard;

. an attacker can exploit an illegal UTF-8 value

by sneaking metacharacters into the string

which can turn parameter names into arbitrary commands .

9.14: 9.15:

. for instance, the wrong way to decode the invalid UTF-8 value

0,C0,80 #16 = (110)0,0000,(10)00,0000 #2

is to remove all the discriminant bits (shown in parentheses),

and return the remaining bits (00#16);

because, the discriminant"(110) was telling you

the remaining bits should have been within 0080 ... 07FF #16;

if the result wasn't in that range, the value is illegal,

and the translator should warn the client .

9.15:

. another example relevant to security

would be filtering to prohibit the char.string that

give a unix filename reference to the parent directory "/../":

/ . . /

2F,2E, 2E,2F#16 . weak parsing could still be caught by this:

2F,C0AE,2E,2F #16 . the same string with 2E replaced by C0AE .

. then the client might misparse this invalid UTF-8 to: 2E#16

when in fact it is an illegal value .

. C0 AE = (110)0,000,(10)10,1110 -utf-8-> 10,1110#2 = 2E#16

(illegal value because for discriminant (110)

the value should be greater than 80#16, and 2E is not ) .

9.15:

. another complication is that

there are some obsolete formats to watch for:

not only does the UTF-8 conversion have to check that it's

holding the correct range of UCS values,

it also has to check for the format being UTF-16:

convert UCS-2 to UTF-8:

http://www.ietf.org/rfc/rfc2279.txt

. UCS-2 is just the first 2 bytes of 4-byte UCS;

so the UCS conversion can be used by

extending each UCS-2 character with two zero-valued bytes: 00,00,xx,xx#16 ;

however,

pairs of UCS-2 values between D800 and DFFF need special treatment:

because, they are "(surrogate pairs)

being actually UCS-4 characters transformed through UTF-16:

. convert the UTF-16 to UTF-32,

then convert the UTF-32 to UTF-8 .

]

9.15: control codes:

00 .. 1F -- C0 set

80 .. 9F -- C1 set

for compatibility with the ISO/IEC 2022 framework

-- 7F (delete) is also a sort of control char .

9.15: BOM and unicode signature U+FEFF:

. a BOM (byte order mark) is the value U+FEFF

prepended to the unicode char.string,

thereby acting as the signature of a Unicode.string

and indicating the byte ordering:

. here is the format of U+FEFF per UTF form and byte ordering:

00 00 FE FF UTF-32, big-endian

FF FE 00 00 UTF-32, little-endian

FE FF UTF-16, big-endian

FF FE UTF-16, little-endian

EF BB BF UTF-8

-- the UTF-8 format is byte-oriented,

so it is unaffected by byte ordering .

. when not at the beginning of a text stream, U+FEFF

should normally not occur.

For backwards compatibility it should be treated as

ZERO WIDTH NON-BREAKING SPACE (ZWNBSP),

and is then part of the content of the file or string.

The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP

for expressing word joining semantics

since it cannot be confused with a BOM.

When designing a markup language or data protocol,

the use of U+FEFF can be restricted to that of Byte Order Mark.

In that case, any FEFF occurring in the middle of a file

can be treated as an unsupported character (zero-width invisible characters)

. invalid for interchange

. if you are sending a string to another node on the net,

or an unfamiliar app on your node,

consider the BOM-related values {0,FFFE#16, 0,FFFF#16} invalid:

remove them from everywhere but the beginning of the string .

. the use of this reserve for BOM implies

all these are non-numbers:

www.unicode.org/versions/Unicode5.0.0/ch02.(pdf) (page 27)

U+{00... 10}{FF,FE... FF,FF}

[9.14:] UTF-16

. values in 0,D800 #16 .... 0,DFFF #16

are specifically reserved for use with UTF-16,

and don't have any characters assigned to them.

. { 0,FF,FE#16, 0,FE,FF#16 } are not Unicode char's precisely to preserve

their usefulness of as a BOM (byte-order mark)

--. either value prepended to a char'stream

is also a hint that it's unicode;

though the BOM can also be followed by a "UTF-16" tag:

eg,Big-endian text labelled with UTF-16, with a BOM:

FE FF D8 08 DF 45 00 3D 00 52 ... ;

Little-endian text labelled with UTF-16, with a BOM:

FF FE 08 D8 45 DF 3D 00 52 00 ... .

. the tags { UTF-16BE"(big-endian), "UTF-16LE"(little-endian), }

can be used in place of the BOM;

-- if the BOM is not at the beginning of the stream,

its meaning changes to a [zero-width non-breaking space] .

[9.15:

http://unicode.org/faq/utf_bom.html#utf16-7

. FDD0...FDEF #16 represent noncharacters.

Unpaired surrogates are invalid as well,

i.e. any value in D800 ... DBFF #16

not followed by a DC00 ... DFFF #16,

or any value in DC00 ... DFFF #16

not preceded by a D800 ... DBFF #16 .

]

how unicode characters are encoded in UTF-16


. for (values <= 10,FFFF#16):
value < 1,00,00#16 ?
      value may be contained in a single 16-bit integer .
value in 1,0000#16 ... 10,FFFF#16 ?
   value - 1,0000#16 ( 0,yyyy,yyyy,yyxx,xxxx,xxxx #2 ) 
   is returned in 2 words:
   word#1 = 1101,10yy,yyyy,yyyy #2
   word#2 = 1101,11xx,xxxx,xxxx #2

    encoding ISO 10646 as UTF-16(U <= 0x10FFFF):
U < 0x10000 ?
    return U as a 16-bit unsigned integer .
else
   Let U' = U - 0x10000.
    -- assert U <= 010,FFFF#16, implying U' <= 0F,FFFF#16 (fitting in 20 bits)
    word#1`= 0xD800  -- assert word#1 = 1101,10yy,yyyy,yyyy
    word#2`= 0xDC00  -- assert word#2 = 1101,11xx,xxxx,xxxx
    word#1`[10 low-order bits]`= [U']`[10 high-order bits]
    word#2`[10 low-order bits]`= [U']`[10 low-order bits]
    return (word#1, word#2) .
Graphically, steps 2 through 4 look like:
   U' = yyyy,yyyy,yyxx,xxxx,xxxx
   W1 = 1101,10yy,yyyy,yyyy
   W2 = 1101,11xx,xxxx,xxxx
U+D800 ... U+DFFF = 1101,1xxx,xxxx,xxxx

        decoding ISO 10646 UTF-16 (U):
. word#1 not in 0xD800 ... 0xDFFF ?
    return word#1
word#1 not in 0xD800 ... 0xDBFF ?
    return error pointer to the value of W1 
word#2 null or not in 0xDC00 ... 0xDFFF?
    return error pointer to the value of W1 
else:
    return (word#1`[10 low-order bits] & word#2`[10 low-order bits])
            + 0x10000
            ) .

Security Considerations

in UTF-16 may contain special characters,

such as the [object replacement character] (0xFFFC),

that might cause external processing,

depending on the interpretation of the processing program

and the availability of an external data stream .

This external processing may have side-effects

that allow the sender of a message to attack the receiving system.

Implementors of UTF-16 need to consider

the security aspects of how they handle illegal UTF-16 sequences

(that is, sequences involving surrogate pairs

that have illegal values or unpaired surrogates).

It is conceivable that in some circumstances

an attacker would be able to exploit an incautious UTF-16 parser

by sending it an octet sequence that is not permitted by the UTF-16 syntax,

causing it to behave in some anomalous fashion.

[9.14:] Comparison of Unicode encodings

UTF-16 extends UCS-2 by using 2words;

but not in the same way that UCS-4 does .

. UTF-32 is incompatible with ASCII files;

because, they contain many nul bytes;

implying the strings cannot be manipulated by normal C string handling .

Therefore most UTF-16 systems such as Windows and Java

represent text objects such as program code with 8-bit encodings

(ASCII, ISO-8859-1, or UTF-8), not UTF-16.

This introduces a serious complication in programming

that is often overlooked by system designers:

many 8-bit encodings (in particular UTF-8)

can contain invalid sequences that cannot be translated to UTF-16

and thus the file can contain a superset of the valid data.

For instance a UTF-8 URL can name a location

that cannot correspond to a file on the system,

or two different files may compare identical,

or reading and writing a file can change it.

One of the few counterexamples of a UTF-16 file is the "strings" file

used by Mac OS 10.3+ applications

for lookup of internationalized versions of messages,

these default to UTF-16

and "(files encoded using UTF-8 are not guaranteed to work.

When in doubt, encode the file using UTF-16).

Oddly enough, OSX is not a UTF-16 system.

[9.14:] all about unicode

ÒUnicodeÓ was originally BMP (U+0000 ... U+FFFF

-- Basic Multilingual Plane).

When it became clear that more than 64k characters would be needed,

Unicode was turned into a sort of 21-bit character set

range U-0000,0000 ... U-0010,FFFF .

The 2*1024 surrogate characters (U+D800 ... U+DFFF)

were introduced into the BMP to allow

1024*1024 non-BMP characters to be represented as

a sequence of two 16-bit surrogate characters.

This way UTF-16 was born,

which represents the extended Ò21-bitÓ Unicode

in a way backwards compatible with UCS-2.

The term UTF-32 was introduced in Unicode to describe

a 4-byte encoding of the extended Ò21-bitÓ Unicode.

. UTF-32 is the exact same thing as UCS-4,

except that by definition

UTF-32 is never used to represent characters above U-0010FFFF,

while UCS-4 can cover all 231 code positions up to U-7FFFFFFF.

The ISO 10646 working group has agreed to modify their standard to exclude

code positions beyond U-0010,FFFF,

in order to turn the new UCS-4 and UTF-32 into practically the same thing.

. some phrases can be expressed in more than one way in ISO 10646/Unicode.

For example, some accented characters can be represented as a single character

(with the accent) and also as a set of characters

(e.g., the base character plus a separate composing accent).

These two forms may appear identical.

There's also a zero-width space

with the result that apparently-similar items are considered different.

Beware of situations where such hidden text could interfere with the program.

This is an issue that in general is hard to solve;

most programs don't have such tight control over the clients

that they know completely how a particular sequence will be displayed

(since this depends on the client's font, display characteristics, locale, ...).

9.14:

[9.15:] UTF-8, a transformation format of ISO 10646

In UTF-8, characters from the range U+00,00 ... U+10,FF,FF

(the UTF-16 accessible range)

are encoded using sequences of 1 to 4 octets.

table for converting UTF-32 to UTF-8:
UTF-32 range (hex.)           UTF-8 octet sequence (binary)
00... 7F            0xxxxxxx
80... 07,FF         110xxxxx 10xxxxxx
08,00... 0,FF,FF       1110xxxx 10xxxxxx 10xxxxxx
01,00,00... 1F,FF,FF      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

bytes never found in UTF-8:

C0#16 = 1100,0000#2

C1#16 = 1100,0001#2

F5#16 = 1111,0101#2

FF#16 = 1111,1111#2

D8,00 ... DF,FF -- illegal: control codes used by UTF-16:

. do UTF-16 -> UTF-32 -> UTF-8 .

This contrasts with CESU-8 [CESU-8],

which is a UTF-8-like encoding

that is not meant for use on the Internet.

CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values

(16-bit quantities) instead of the character number (code point).

This leads to different results for character numbers above 0xFFFF;

the CESU-8 encoding of those characters is NOT valid UTF-8.

[9.15:]

The definition of UTF-8 prohibits encoding character numbers between
U+, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters. When encoding in UTF-8 from UTF-16 data, it is necessary
to first decode the UTF-16 data to obtain character numbers, which
are then encoded in UTF-8 as described above.

. thus, it is easy to convert to utf-8;

it is considerably less simple, however,

to validate that a utf-8 code was converted correctly;

. the code positions U+D800 to U+DFFF (UTF-16 surrogates)

as well as U+FFFE and U+FFFF

must not occur in normal UTF-8 or UCS-4 data.

(binary view of illegal chars)

U+D800 1101,1000,0000,0000

U+DFFF 1101,1111,1111,1111

U+FFFE 1111,1111,1111,1110

U+FFFF 1111,1111,1111,1111

. for safety reasons, UTF-8 decoders should treat them like

malformed or over-long sequences .

. the table above says this like so:

. say the 2..4 bytes have bits named bit#1 ... bit#32:

and bit#{ 1, 2 }  ?
assert bit#3 = 0, and byte#2 starts with 10.base2;
bit#{ 4 ... 8} & bit#{ 11 .. 16} --[ should be in range 0080 ... 07FF ]
< 80.base16 ? raise illegal.coding.error
else: and bit#{ 1, 2, 3 }  ?
assert bit#4 = 0, and byte#{2,3} starts with 10.base2;
bit#{ 5 ... 8} & bit#{ 11 .. 16} & bit#{ 19 ... 24 }
--[should be in range 0800 ... FFFF ]
< 800.base16 ? raise illegal.coding.error
else: and bit#{ 1, 2, 3, 4}  ?
assert bit#5 = 0, and byte#{2,3,4} starts with 10.base2;
bit#{ 6 ... 8} & bit#{ 11 .. 16} & bit#{ 19 ... 24 } & bit#{ 27 ... 32 }
--[should be in range 1,0000 ... 1F,FFFF ]
< 1,0000.base16 ? raise illegal.coding.error

9.14: why bloat the code with constants in every byte?

. among a list of desiderata for any such encoding,
is the ability to synchronize a byte stream picked up mid-run,
with less that one character being consumed before synchronization.
. The model for multibyte processing has it that
ASCII does not occur anywhere in a multibyte encoding.
There should be no ASCII code values for any part of
a UTF representation of a character that was not in the
ASCII character set in the UCS representation of the character.
Historical file systems disallow the null byte and the ASCII slash character
as a part of the file name

convert UTF-8 to UCS-4:

1) zero the 4 octets (bytes) of the UCS-4 character
2) Determine which bits encode the character value
from the number of octets in the sequence
and the second column of the table above (the bits marked x).
3) Distribute the bits from the sequence to the UCS-4 character,
first the lower-order bits from the last octet of the sequence and
proceeding to the left until no x bits are left.
If the UTF-8 sequence is no more than three octets long,
decoding can proceed directly to UCS-2.

[9.13:] UTF-8, a transformation format of ISO 10646

. UCS (Universal Character Set) is defined by ISO/IEC 10646
as having multiple octets (bytes) .
. UTF-8 (UCS transformation format -- 8bit)
is a way of staying compatable with single-byte ASCII
by using a 1...4-byte variable-length for describing
the same character set as the 4-byte UCS .

an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU)

. CESU-8 (8-bit Compatibility Encoding Scheme for UTF-16)
is intended for internal use within systems processing Unicode
in order to provide an ASCII-compatible 8-bit encoding
that is similar to UTF-8 but preserves UTF-16 binary collation.
It is not intended nor recommended as an encoding used for open information exchange.

CESU-8 Bit Distribution

UTF-16 Code byte#1 byte#2 byte#3

0000,0000,0xxx,xxxx 0xxxxxxx

0000,0yyy,yyxx,xxxx 110yyyyy 10xxxxxx

zzzz,yyyy,yyxx,xxxx 1110zzzz 10yyyyyy 10xxxxxx

important features of this encoding form:
The CESU-8 representation of characters on the Basic Multilingual Plane (BMP)
is identical to the representation of these characters in UTF-8.
Only the representation of supplementary characters differs.
Only the six-byte form of supplementary characters is legal in CESU-8;
the four-byte UTF-8 style supplementary character sequence is illegal.
A binary collation of data encoded in CESU-8
is identical to the binary collation of the same data encoded in UTF-16.
As a very small percentage of characters in a typical data stream
are expected to be supplementary characters,
there is a strong possibility that CESU-8 data may be misinterpreted as UTF-8.
Therefore,
all use of CESU-8 outside closed implementations is strongly discouraged,
such as the emittance of CESU-8 in output files,
markup language or other open transmission forms.

converting {binary, hex}:

1000 8

1001 9

1010 10 A

1011 11 B

1100 12 C

1101 13 D

1110 14 E

1111 15 F

. for more examples, see wikipedia:

. see Markus Kuhn's UTF-8 decoder verification stress test

http://www.unicode.org/charts/symbols.html

http://unicode.org/faq/utf_bom.html

. C routines converting between various utf's(text.c)

(binary view of illegal chars)

U+D800 ... U+DFFF = 1101,1xxx,xxxx,xxxx

these are for the 2 words in UTF-16:

word#1 = 1101,10yy,yyyy,yyyy

word#2 = 1101,11xx,xxxx,xxxx

U+FFFE 1111,1111,1111,1110

U+FFFF 1111,1111,1111,1111

http://www.unicode.org/versions/Unicode5.0.0/bookmarks.html

todo:

see Unicode Technical Report #36, Unicode Security Considerations,

and Unicode Technical Standard #39, Unicode Security Mechanisms.

Americium Dream Documents

2009-12-29

booking unicode

No comments:

Post a Comment

(As an Amazon Associate I earn from qualifying purchases.); pages of alpha doc's

posts by category

Blog Archive

tags

About Me

Facebook

search Wikipedia

Search This Blog