3.6: adda/wordcode/including grammar trees:
. not only can we be encoding all words,
but also have etrees (grammar trees) to the words
. we should reserve some ascii control codes for this,
since that will be common .
. relative addressing can reduce pointer size,
ie, the pointers can be byte sized
if we know that their base address
(the address they are relative to)
is no farther than 256 units from the
farthest place the pointer can indicate .
. thus to be byte sized,
we need to have a separate etree for every phrase
(delimited by semicolon or period)
and we need to know that the size of phrases
will be less than 128 words
(there will be as many pointers as words,
and the first node may have to point into
the middle of the phrases text,
so it has to jump past all the pointers,
plus half the text).
[3.7:
. we can give the pointers more range by
making them relative to where the text begins .
. we could have a code that says
"( this is the beginning of an etree;
what follows is a link to the beginning of the text;
if the first byte of this link is a negative number,
then there's only one byte,
and its absolute value is showing
how many bytes ahead the text is;
it also says the etree's pointers are byte-sized .
. if the first byte is positive,
then the link is composed of 2 bytes,
showing you how far ahead the text is;
this also says the etree's pointers are 2-byte sized .]
. you can tell what a node's pointer is pointing at
by looking at the 1st byte of what it's pointing at:
if it's not the code for a node, then it's text;
therefore a node needs to have 3 parts:
a 1-byte node code, and 2 pointers .
. english syntax trees can have long tree nodes;
eg, (if * then * else * ) = 6 pointers in one node;
but, generally, all syntax trees can be reduced to
a sequence of minimal nodes (2pointers);
so one node code could mean
it's the non-end of a sequence,
and the other node could mean
it's the end a node sequence .
. but there is a more compact way:
we could have just one node code
followed by a 1-byte length field,
and that tells us how many pointers follow .
. or
if we have more codes to spare,
then we could have 5 node codes:
#1: 2pointers, (eg, the * )
#2: 3pointers, (eg, * unless *)
#3: 4pointers, (eg, if * then * )
#4: 6pointers (eg, if * then * else * )
#5: n-pointers -- the generic case:
it means the number of pointers is in the following byte .
Showing posts with label wordcode. Show all posts
Showing posts with label wordcode. Show all posts
2012-03-31
2012-02-28
wordcoding and code paging
2.12: adda/wordcode/
code page for programming symbols:
. some places where cocoa uses strings
it means to use symbols,
a 4byte number would do it:
each library has a 16-bit serial number;
it may contain a 16-bit number of symbols .
. english words will be encoded too,
so maybe both could be part of the same coding system?
or be modular, part of its own code page .
. the programming symbols can be used alone
or the english words can embed it by saying
this is jargon from computing:
the next 4bytes belong to it .
adda/wordcode/styles of code paging:
. code paging is a system for saving space
by using shorter codes for frequently used words .
. ascii is size byte, but the sign.bit = 0 .
. there are 2 ways to signal a larger size:
# 2 bits as discriminants:
0x -> 1 byte: 7-bit space,
10 -> 2 bytes: 14-bit space
11 -> 4 bytes 30-bit space .
# sign.bit discriminant:
. sign.bit =1 means word is 2 bytes
so then word space is 15 bits .
. sign.bit=0 means 1 byte,
but reuse the control codes as discrimininants:
most serve no addx purpose except
( 9: tab, 10: newline, 13: carriage return )
all the others -- ( 0..31 -{9,10, 13} ) --
could be indicating the code pages that are
using codes of size 2, 3, or 4 bytes .
. null could be the escape,
it would mean the next byte is the size field,
and then the next string of that size
should be taken as ascii -- even the control codes .
pages for strings:
. that same string idea could apply to all code pages;
code pages would come in twins,
one for single char, and one is a string
where the 2nd byte of the string is the size,
coded like so:
0: null terminated
1 ... 127: actual size in bytes
128..255: size is {128, 256, 512, ... 2^134 } .
. you would combine these large strings
just as the number system combines large digits .
. one reason unicode did it their way
(where null was reserved for end-of-string)
is that it was preferred by unix/c coders
because it saves space if using only ascii
(no matter how long your strings are,
the size field need be only one byte wide
-- the size of your terminating null character).
. but that takes a huge chunk of 4-byte space;
considering the high-order bytes can't be zero .
. so, when do you need to communicate with unix/c?
the only time is when it not only expects no nulls,
but also expects all ascii,
so just convert to ascii at those times;
for other communications, use sockets or temp files .
2.13: prefixes and suffixes:
. some words in the 3 or 4 byte range
had a base code for finding the base word
then the extension code would indicate
the fom of the word .
. there is some main combination of prefixes and suffixes,
and then one code page can be adding the prefix word;
ie, stating which prefix combinations are being used .
. there might need to be 2 versions:
one is giving the main combinations,
(eg, -ing-ly is one combination)
the other version, for uncommon combinations,
is using word strings, eg:
anti.dis.establisment.arianism
might be 4 words .
the [all one word] codepage:
. one string code is for all one word:
if (understand) was not a common word,
it would be represented as (under, stand)
this would be independent of code page:
it is saying print the next 2 words without a space .
. another way is to use the backspace char:
every word implies adding a space afterward,
but if there's a backspace char, then it connects words
instead of separating them .
. yet another more generic function is to
have a container with 2 params:
it shows how many of the following words are contained,
that may be up to 5 bits (32 word max)
and it uses 3-bits for indicating
a style for the arrangement of the words:
{ no spaces between words
, nonbreaking spaces, dashes, underscores
, camelcasing, all initials capitalized
, underline
}.
the hash table:
. to build the word code array,
look at all text in the library;
hash each word and place in hash table;
the hash table entry has a record of stats for the word:
how it is spelled (for the sake of collisions)
and how many times it was found .
. then find the 32768 most popular words
and assign them to the shorter, 2-byte codes .
. make a new hash table of (spelling, wordcode)
for turning words into worcodes .
. make an array 0..32767 of (spelling, pre/postfix spellings)
for turning worcodes into words .
code page for programming symbols:
. some places where cocoa uses strings
it means to use symbols,
a 4byte number would do it:
each library has a 16-bit serial number;
it may contain a 16-bit number of symbols .
. english words will be encoded too,
so maybe both could be part of the same coding system?
or be modular, part of its own code page .
. the programming symbols can be used alone
or the english words can embed it by saying
this is jargon from computing:
the next 4bytes belong to it .
adda/wordcode/styles of code paging:
. code paging is a system for saving space
by using shorter codes for frequently used words .
. ascii is size byte, but the sign.bit = 0 .
. there are 2 ways to signal a larger size:
# 2 bits as discriminants:
0x -> 1 byte: 7-bit space,
10 -> 2 bytes: 14-bit space
11 -> 4 bytes 30-bit space .
# sign.bit discriminant:
. sign.bit =1 means word is 2 bytes
so then word space is 15 bits .
. sign.bit=0 means 1 byte,
but reuse the control codes as discrimininants:
most serve no addx purpose except
( 9: tab, 10: newline, 13: carriage return )
all the others -- ( 0..31 -{9,10, 13} ) --
could be indicating the code pages that are
using codes of size 2, 3, or 4 bytes .
. null could be the escape,
it would mean the next byte is the size field,
and then the next string of that size
should be taken as ascii -- even the control codes .
pages for strings:
. that same string idea could apply to all code pages;
code pages would come in twins,
one for single char, and one is a string
where the 2nd byte of the string is the size,
coded like so:
0: null terminated
1 ... 127: actual size in bytes
128..255: size is {128, 256, 512, ... 2^134 } .
. you would combine these large strings
just as the number system combines large digits .
. one reason unicode did it their way
(where null was reserved for end-of-string)
is that it was preferred by unix/c coders
because it saves space if using only ascii
(no matter how long your strings are,
the size field need be only one byte wide
-- the size of your terminating null character).
. but that takes a huge chunk of 4-byte space;
considering the high-order bytes can't be zero .
. so, when do you need to communicate with unix/c?
the only time is when it not only expects no nulls,
but also expects all ascii,
so just convert to ascii at those times;
for other communications, use sockets or temp files .
2.13: prefixes and suffixes:
. some words in the 3 or 4 byte range
had a base code for finding the base word
then the extension code would indicate
the fom of the word .
. there is some main combination of prefixes and suffixes,
and then one code page can be adding the prefix word;
ie, stating which prefix combinations are being used .
. there might need to be 2 versions:
one is giving the main combinations,
(eg, -ing-ly is one combination)
the other version, for uncommon combinations,
is using word strings, eg:
anti.dis.establisment.arianism
might be 4 words .
the [all one word] codepage:
. one string code is for all one word:
if (understand) was not a common word,
it would be represented as (under, stand)
this would be independent of code page:
it is saying print the next 2 words without a space .
. another way is to use the backspace char:
every word implies adding a space afterward,
but if there's a backspace char, then it connects words
instead of separating them .
. yet another more generic function is to
have a container with 2 params:
it shows how many of the following words are contained,
that may be up to 5 bits (32 word max)
and it uses 3-bits for indicating
a style for the arrangement of the words:
{ no spaces between words
, nonbreaking spaces, dashes, underscores
, camelcasing, all initials capitalized
, underline
}.
the hash table:
. to build the word code array,
look at all text in the library;
hash each word and place in hash table;
the hash table entry has a record of stats for the word:
how it is spelled (for the sake of collisions)
and how many times it was found .
. then find the 32768 most popular words
and assign them to the shorter, 2-byte codes .
. make a new hash table of (spelling, wordcode)
for turning words into worcodes .
. make an array 0..32767 of (spelling, pre/postfix spellings)
for turning worcodes into words .
2012-01-31
which scripts are in what-size utf-8 codes
adde/utf-8/which codes are in how many bytes?
1.5: todo: [done]
. in utf-8, the number of bytes per code is incremental:
ascii uses 8bytes;
then how is it divided between 2..4 bytes
along what code pages? [1.31:
11-bits in 2 bytes, -- most euro or compact langs
16-bits in 3 bytes -- math & most other lang's
21-bits in 4 bytes --
In November 2003 UTF-8 was restricted to U+10FFFF,
in order to match the constraints of the
UTF-16 character encoding.
This removed the 5 and 6-byte sequences,
and about half the 4-byte sequences.]
1.6: web:
Bits Last code point Byte 1 Byte 2..4
7 7F 0xxxxxxx
11 07FF 110xxxxx 10xxxxxx
--. Latin letters with diacritics and characters from the
-- Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets
16 FFFF 1110xxxx 10xxxxxx *2
--. the rest of the Basic Multilingual Plane
-- (which contains virtually all characters in common use).
21 10,FFFF 11110xxx 10xxxxxx *3
--. less common CJK characters and various historic scripts.
1.31: web: where are the math symbols? in 3-bytes:
1.5: todo: [done]
. in utf-8, the number of bytes per code is incremental:
ascii uses 8bytes;
then how is it divided between 2..4 bytes
along what code pages? [1.31:
11-bits in 2 bytes, -- most euro or compact langs
16-bits in 3 bytes -- math & most other lang's
21-bits in 4 bytes --
In November 2003 UTF-8 was restricted to U+10FFFF,
in order to match the constraints of the
UTF-16 character encoding.
This removed the 5 and 6-byte sequences,
and about half the 4-byte sequences.]
1.6: web:
Bits Last code point Byte 1 Byte 2..4
7 7F 0xxxxxxx
11 07FF 110xxxxx 10xxxxxx
--. Latin letters with diacritics and characters from the
-- Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets
16 FFFF 1110xxxx 10xxxxxx *2
--. the rest of the Basic Multilingual Plane
-- (which contains virtually all characters in common use).
21 10,FFFF 11110xxx 10xxxxxx *3
--. less common CJK characters and various historic scripts.
1.31: web: where are the math symbols? in 3-bytes:
Mathematical Operators (U+2200–U+22FF)
Miscellaneous Mathematical Symbols-A (U+27C0–U+27EF)
Miscellaneous Mathematical Symbols-B (U+2980–U+29FF)
Supplemental Mathematical Operators (U+2A00–U+2AFF)
Letterlike Symbols (U+2100–U+214F)
Ceilings and floors in Miscellaneous technical (U+2308–U+230B)
Geometric Shapes (U+25A0–U+25FF)
Arrows in Miscellaneous Symbols and Arrows (U+2B30–U+2B4C)
Mathematical Alphanumeric Symbols (1D400–1D7FF)
2009-12-28
wordcode
9.27: todo.adde/wordcode/decomposing word`parts:
. teasing words apart could get done pretty fast,
once you have a list of of basic parts,
and then have a list of
all words in your db that have those parts .
. it lists the words, you study them,
and it makes it easy for you catch the exceptions
or find new patterns in words for more efficient coding .
10.22: adda/unix/tools communicating with binary pipes:
. the unix way
is to have tools communicating with text pipes,
whereas, the goal of adda
is to have a comm'standard that's binary;
. unix is the primary target platform;
so, I'm wondering how to efficiently pack binary
into unix text strings
(where there can be no null's; ie, no bytes = 00) .
sockets:
. use of string may be a requirement of
tool communications within a std unix shell;
but, for connecting tools within your own shell,
unix sockets can provide binary app-to-app pipes .
. that way,
you can have your app's talk to each other in binary
while exports to others can be done by
translating your binary to their {unicode, xml, ...} .
10.22: adda/unix/wrapping binary files in text:
. the new std is to use unicode,
and these values
can be reused for a binary std's wordcodes
(similar to the way chinese text
has a separate character for each word) .
. a more efficent way
is to think of each byte as being one digit of a number
(there are 255 non-zero values in a byte) .
. if practicality requires your number be in base 2**n
then a byte can support a number system of base 128:
(having 1..127 map to the same,
and zero is quickly flipped to be FF#16 (-1) )
. that still leaves each byte's other negative values
to mean something else;
eg, when finding a negaitive byte,
get the binary complement;
and if not 0,
then have the byte represent n+1 consecutive zeroes .
[10.28:
. or more likely,
they could be reserved for indicating
the type or length of the next digit sequence;
eg, then your number stream could be variable-length
like unicode,
except it could have string descriptors,
where a negative would say that until the next descriptor,
the default number length would be 4bytes instead of 1 .
(unlike unix where everything was byte-based,
this would be word-based,
so apon reading the next element of a file,
it uses these descriptors to find complete elements)
] .
11.9: engl/word.ules:
A lexeme is an abstract unit of morphological analysis in linguistics,
that roughly corresponds to a set of forms taken by a single word.
For example, in the English language,
run, runs, ran and running are forms of the same lexeme, run .
Lexemes are often composed of smaller units
with individual meaning called morphemes .
Subscribe to:
Posts (Atom)