Americium Dream Documents: unicode as wordcode transport

2009-12-28

unicode as wordcode transport

9.14: adde/unicode/bloated for unix way:

. the unicode UTF is such a hog for space only because

it had to be compatable with the ways of unix programs:

they expect to be able to jump in midstream

and never find a null byte .

. by always tracking size by a length field

rather than a sentinel byte,

the coding system can be more compact .

. for the first 2 bits:

0x is ascii on byte

11 is utf-8

10 is available for personal use,

as long as it's kept in binary files

so that unix programs don't think it's a superset of utf;

otherwise,

10 is part of every non-head byte,

so that unix can tell in midstream

whether the current byte is part of a UTF-8 extension

rather than ascii .

adde/unicode/country-coding for efficiency:

. the way to think of unicode's pictographs

is that they are representing whole words of a foreign language,

but that they could represent the words of any language

if only there were a code for changing the language .

. instead of using their space for their words,

each user should use that space for the words of their own language .

[9.15:

. ideally, you would want to keep the meanings the same,

so the chinese fish character would be replaced with the string"(fish) .

but as is so often the case with language translation

many word definitions depend on the other words they're combined with .

. better to save UTF-8 for external communications,

and then organize the lang so that

the most common parts of the language

are described in the first 16 bits of the code .

]

. with today's large memories,

it's possible to have an array of integer

that can hold strings to the most common words,

thereby compacting words down to 2..3-bytes each .

. the size of text files would be reduced by 5..10 times

unless the text used a lot of personal names

which have to be spelled out .

. a file can also locally define free unicodes

as representing repeated phrases and nouns .

Americium Dream Documents

2009-12-28

unicode as wordcode transport

No comments:

Post a Comment

(As an Amazon Associate I earn from qualifying purchases.); pages of alpha doc's

posts by category

Blog Archive

tags

About Me

Facebook

search Wikipedia

Search This Blog