1.5: todo: [done]
. in utf-8, the number of bytes per code is incremental:
ascii uses 8bytes;
then how is it divided between 2..4 bytes
along what code pages? [1.31:
11-bits in 2 bytes, -- most euro or compact langs
16-bits in 3 bytes -- math & most other lang's
21-bits in 4 bytes --
In November 2003 UTF-8 was restricted to U+10FFFF,
in order to match the constraints of the
UTF-16 character encoding.
This removed the 5 and 6-byte sequences,
and about half the 4-byte sequences.]
1.6: web:
Bits Last code point Byte 1 Byte 2..4
7 7F 0xxxxxxx
11 07FF 110xxxxx 10xxxxxx
--. Latin letters with diacritics and characters from the
-- Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets
16 FFFF 1110xxxx 10xxxxxx *2
--. the rest of the Basic Multilingual Plane
-- (which contains virtually all characters in common use).
21 10,FFFF 11110xxx 10xxxxxx *3
--. less common CJK characters and various historic scripts.
1.31: web: where are the math symbols? in 3-bytes:
Mathematical Operators (U+2200–U+22FF)
Miscellaneous Mathematical Symbols-A (U+27C0–U+27EF)
Miscellaneous Mathematical Symbols-B (U+2980–U+29FF)
Supplemental Mathematical Operators (U+2A00–U+2AFF)
Letterlike Symbols (U+2100–U+214F)
Ceilings and floors in Miscellaneous technical (U+2308–U+230B)
Geometric Shapes (U+25A0–U+25FF)
Arrows in Miscellaneous Symbols and Arrows (U+2B30–U+2B4C)
Mathematical Alphanumeric Symbols (1D400–1D7FF)
No comments:
Post a Comment