2012-02-28

wordcoding and code paging

2.12: adda/wordcode/
code page for programming symbols:

. some places where cocoa uses strings
it means to use symbols,
a 4byte number would do it:
each library has a 16-bit serial number;
it may contain a 16-bit number of symbols .
. english words will be encoded too,
so maybe both could be part of the same coding system?
or be modular, part of its own code page .
. the programming symbols can be used alone
or the english words can embed it by saying
this is jargon from computing:
the next 4bytes belong to it .

adda/wordcode/styles of code paging:

. code paging is a system for saving space
by using shorter codes for frequently used words .
. ascii is size byte, but the sign.bit = 0 .
. there are 2 ways to signal a larger size:

# 2 bits as discriminants:
0x -> 1 byte: 7-bit space,
10 -> 2 bytes: 14-bit space
11 -> 4 bytes  30-bit space .

# sign.bit discriminant:
. sign.bit =1 means word is 2 bytes
so then word space is 15 bits .
. sign.bit=0 means 1 byte,
but reuse the control codes as discrimininants:
most serve no addx purpose except
( 9: tab, 10: newline, 13: carriage return )
all the others -- ( 0..31 -{9,10, 13} ) --
could be indicating the code pages that are
using codes of size 2, 3, or 4 bytes .
. null could be the escape,
it would mean the next byte is the size field,
and then the next string of that size
should be taken as ascii -- even the control codes .

pages for strings:
. that same string idea could apply to all code pages;
code pages would come in twins,
one for single char, and one is a string
where the 2nd byte of the string is the size,
coded like so:
0: null terminated
1 ... 127: actual size in bytes
128..255: size is {128, 256, 512, ... 2^134 } .
. you would combine these large strings
just as the number system combines large digits .

 . one reason unicode did it their way
 (where null was reserved for end-of-string)
is that it was preferred by unix/c coders
because it saves space if using only ascii
(no matter how long your strings are,
the size field need be only one byte wide
-- the size of your terminating null character).
. but that takes a huge chunk of 4-byte space;
considering the high-order bytes can't be zero .
. so, when do you need to communicate with unix/c?
the only time is when it not only expects no nulls,
but also expects all ascii,
so just convert to ascii at those times;
for other communications, use sockets or temp files .

2.13: prefixes and suffixes:

. some words in the 3 or 4 byte range
had a base code for finding the base word
then the extension code would indicate
the fom of the word .
. there is some main combination of prefixes and suffixes,
and then one code page can be adding the prefix word;
ie, stating which prefix combinations are being used .
. there might need to be 2 versions:
one is giving the main combinations,
(eg, -ing-ly is one combination)
the other version, for uncommon combinations,
is using word strings, eg:
anti.dis.establisment.arianism
might be 4 words .

the [all one word] codepage:

. one string code is for all one word:
if (understand) was not a common word,
it would be represented as (under, stand)
this would be independent of code page:
it is saying print the next 2 words without a space .
. another way is to use the backspace char:
every word implies adding a space afterward,
but if there's a backspace char, then it connects words
instead of separating them .

. yet another more generic function is to
have a container with 2 params:
it shows how many of the following words are contained,
that may be up to 5 bits (32 word max)
and it uses 3-bits for indicating
a style for the arrangement of the words:
{ no spaces between words
, nonbreaking spaces, dashes, underscores
, camelcasing, all initials capitalized
, underline
}.

the hash table:
. to build the word code array,
look at all text in the library;
hash each word and place in hash table;
the hash table entry has a record of stats for the word:
how it is spelled (for the sake of collisions)
and how many times it was found .
. then find the 32768  most popular words
and assign them to the shorter, 2-byte codes .
. make a new hash table of (spelling, wordcode)
for turning words into worcodes .
. make an array 0..32767 of (spelling, pre/postfix spellings)
for turning worcodes into words .

1 comment:

  1. 8.18: adde/wordcode
    /combined global and local dictionaries:
    . the way to keep down the dictionary size
    is to have a small global dictionary of
    very common terms
    then have a custom dictionary in file,
    which is a list of pointers into a string heap;
    the file contains all codes, so are from global
    and the rest are from the custom dictionary .
    this makes it easy to generate keywords per file
    and to search keywords
    as well as to condense the file size .

    ReplyDelete