2013-12-11

efficient Levenshtein distance

10.15: news.adds/algorithm/efficient Levenshtein distance:
. the Levenshtein distance between two words
is the minimum number of single-character edits
(insertion, deletion, substitution)
required to change one word into the other.
. finding the Lev' distance takes a long time,
on the order of (length of longer string
times length of shorter string),
so we are looking for ways to avoid it .

Leo Polovets, a quora.com participant,
came up with an approach that computed
a character distribution histogram .

. given a string, see if it is within
a given Levenshtein distance
to any string in a dictionary.

. view a 64-bit integer as
16 nibbles (sets of 4 bits),
and assign characters to each nibble
by associating a given character
with the (character mod 16)th nibble .
. then in each nibble of a string's histogram
have a count of each time
one of that nibble's assigned characters
was used by the input string .

For example, in the string "hello",
the ASCII values of characters are
104, 101, 108, 108, 111, which modulo 16 are
8, 5, 12, 12, 15, so there is
a count of 2 in the 12th nibble;
because, 2 of the characters mod 16 = 12;
then for similar reasons,
there's a 1 in the 8th 5th and 15th nibbles,
and zeroes in every other nibble .

. with every string boiled down to
a single 64-bit histogram,
use bit arithmetic to quickly determine
how much 2 words differ in their use of
each of the mod-16 classes of characters .
. if there are many differences in
these mod-16 character classes
there is an even greater edit distance;
so, if we are looking for
an edit distance of less than 2
as a spell checker might do,
then we can rule out histogram differences
of greater than 2,
and computing the full Levenshtein distance
can be avoided most of the time .

No comments:

Post a Comment