2012-01-31

which scripts are in what-size utf-8 codes

adde/utf-8/which codes are in how many bytes?
1.5: todo: [done]
. in utf-8, the number of bytes per code is incremental:
ascii uses 8bytes;
then how is it divided between 2..4 bytes
along what code pages? [1.31:
11-bits in 2 bytes, -- most euro or compact langs
16-bits in 3 bytes -- math & most other lang's
21-bits in 4 bytes --
In November 2003 UTF-8 was restricted to U+10FFFF,
in order to match the constraints of the
UTF-16 character encoding.
This removed the 5 and 6-byte sequences,
and about half the 4-byte sequences.]
1.6: web:
Bits     Last code point     Byte 1     Byte 2..4
  7     7F                0xxxxxxx
11     07FF            110xxxxx     10xxxxxx
--. Latin letters with diacritics and characters from the
-- Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets

16     FFFF            1110xxxx     10xxxxxx   *2
--. the rest of the Basic Multilingual Plane
-- (which contains virtually all characters in common use).

21     10,FFFF        11110xxx     10xxxxxx     *3
--. less common CJK characters and various historic scripts.
1.31: web: where are the math symbols? in 3-bytes:
    Mathematical Operators (U+2200–U+22FF)
    Miscellaneous Mathematical Symbols-A (U+27C0–U+27EF)
    Miscellaneous Mathematical Symbols-B (U+2980–U+29FF)
    Supplemental Mathematical Operators (U+2A00–U+2AFF)
    Letterlike Symbols (U+2100–U+214F)
    Ceilings and floors in Miscellaneous technical (U+2308–U+230B)
    Geometric Shapes (U+25A0–U+25FF)
    Arrows in Miscellaneous Symbols and Arrows (U+2B30–U+2B4C)
    Mathematical Alphanumeric Symbols (1D400–1D7FF)

No comments:

Post a Comment