2009-12-28

wordcode

9.27: todo.adde/wordcode/decomposing word`parts:
. teasing words apart could get done pretty fast,
once you have a list of of basic parts,
and then have a list of
all words in your db that have those parts .
. it lists the words, you study them,
and it makes it easy for you catch the exceptions
or find new patterns in words for more efficient coding .


10.22: adda/unix/tools communicating with binary pipes:
. the unix way
is to have tools communicating with text pipes,
whereas, the goal of adda
is to have a comm'standard that's binary;
. unix is the primary target platform;
so, I'm wondering how to efficiently pack binary
into unix text strings
(where there can be no null's; ie, no bytes = 00) .
sockets:
. use of string may be a requirement of
tool communications within a std unix shell;
but, for connecting tools within your own shell,
unix sockets can provide binary app-to-app pipes .
. that way,
you can have your app's talk to each other in binary
while exports to others can be done by
translating your binary to their {unicode, xml, ...} .

10.22: adda/unix/wrapping binary files in text:
. the new std is to use unicode,
and these values
can be reused for a binary std's wordcodes
(similar to the way chinese text
has a separate character for each word) .
. a more efficent way
is to think of each byte as being one digit of a number
(there are 255 non-zero values in a byte) .
. if practicality requires your number be in base 2**n
then a byte can support a number system of base 128:
(having 1..127 map to the same,
and zero is quickly flipped to be FF#16 (-1) )
. that still leaves each byte's other negative values
to mean something else;
eg, when finding a negaitive byte,
get the binary complement;
and if not 0,
then have the byte represent n+1 consecutive zeroes .
[10.28:
. or more likely,
they could be reserved for indicating
the type or length of the next digit sequence;
eg, then your number stream could be variable-length
like unicode,
except it could have string descriptors,
where a negative would say that until the next descriptor,
the default number length would be 4bytes instead of 1 .
(unlike unix where everything was byte-based,
this would be word-based,
so apon reading the next element of a file,
it uses these descriptors to find complete elements)
] .

11.9: engl/word.ules:
A lexeme is an abstract unit of morphological analysis in linguistics,
that roughly corresponds to a set of forms taken by a single word.
For example, in the English language,
run, runs, ran and running are forms of the same lexeme, run .
Lexemes are often composed of smaller units
with individual meaning called morphemes .

No comments:

Post a Comment