2009-12-28

unicode as wordcode transport

9.14: adde/unicode/bloated for unix way:
. the unicode UTF is such a hog for space only because
it had to be compatable with the ways of unix programs:
they expect to be able to jump in midstream
and never find a null byte .
. by always tracking size by a length field
rather than a sentinel byte,
the coding system can be more compact .
. for the first 2 bits:
0x is ascii on byte
11 is utf-8
10 is available for personal use,
as long as it's kept in binary files
so that unix programs don't think it's a superset of utf;
otherwise,
10 is part of every non-head byte,
so that unix can tell in midstream
whether the current byte is part of a UTF-8 extension
rather than ascii .


adde/unicode/country-coding for efficiency:
. the way to think of unicode's pictographs
is that they are representing whole words of a foreign language,
but that they could represent the words of any language
if only there were a code for changing the language .
. instead of using their space for their words,
each user should use that space for the words of their own language .
[9.15:
. ideally, you would want to keep the meanings the same,
so the chinese fish character would be replaced with the string"(fish) .
but as is so often the case with language translation
many word definitions depend on the other words they're combined with .
. better to save UTF-8 for external communications,
and then organize the lang so that
the most common parts of the language
are described in the first 16 bits of the code .
]
. with today's large memories,
it's possible to have an array of integer
that can hold strings to the most common words,
thereby compacting words down to 2..3-bytes each .
. the size of text files would be reduced by 5..10 times
unless the text used a lot of personal names
which have to be spelled out .
. a file can also locally define free unicodes
as representing repeated phrases and nouns .

No comments:

Post a Comment