Data Searching using Levenshtein
by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC
Data Searching is a very complex art in itself. Ignoring the long history of
Internet "search engines" growing from Archie in 1992 to today's Google engine.
Original techniques to provide users with "full text" search required a solution
designed for common mistakes, one of which is called Levenshtein.
Levenshtein distance is a metric for measuring the amount of difference between
two sequences (i.e., the so called edit distance). The Levenshtein distance
between two strings is given by the minimum number of operations needed to
transform one string into the other, where an operation is an insertion, deletion,
or substitution of a single character.
Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function Levenshtein(const String1,String2: string): extended; // similarity engine
var
{$IFDEF VER100} // Delphi 3 can only do 4kbx4kb array - does not support dynamic arrays
Matrix:Array[0..4095] of Array[0..4095] of Integer;
{$ELSE}
Matrix:Array of Array of Integer;
{$ENDIF}
maxl,i,j,n,m,distanz,cost:Integer;
s_i,t_j:char;
begin
Result:=0.0;
If (String1='') or (String2='') then Exit;
{$IFDEF ASSEMBLY}
asm // 1090108: By Ozz (FASTER THAN Length()!)
MOV EAX, String1; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV N, EAX; // Put into Result
MOV EAX, String2; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV M, EAX; // Put into Result
End;
{$ELSE}
N:=Length(String1);
M:=Length(String2);
{$ENDIF}
{$IFNDEF VER100}
setlength(Matrix,n+1,m+1);
{$ENDIF}
if n=0 then distanz:=m
else begin
if m=0 then distanz:=n
else begin
for i:=0 to n do Matrix[i,0]:=i;
for j:=0 to m do Matrix[0,j]:=j;
for i:=1 to n do begin
s_i:=String1[i];
for j:=1 to m do begin
t_j:=String2[j];
if s_i=t_j then cost:=0
else cost:=1;
Matrix[i,j]:=Min(Min(Matrix[i-1,j]+1,Matrix[i,j-1]+1),Matrix[i-1,j-1]+cost);
end;
end;
distanz:=Matrix[n,m];
end;
end;
if m>n then maxl:=m
else maxl:=n;
result:=1-(distanz/maxl);
if result<0 then result:=0
else Result:=Result*100;
end;
Using this algorithm, both "Robert" and "Rupert" return "66.66%" the same,
while "Rubin" and "Robin" yields "80%". "Ashcroft" and "Ashcraft" yields "87.5%".
However, "George" and "Jorge" yields "66.66%", and "Dave" and "David"
yields "60%" the same.
In this sample, we see another way of veiwing the differences between words. Things to
pay close attention to is the result for Ashcroft and Ashcraft, even though close to us,
this algorithm really didn't consider them very closer. Also, note that George and Jorge
return the same result as Robert and Rupert. However, Dave and David are still out there.
Please continue reading the other algorithms I use in my search engines.
G.E. Ozz Nixon Jr.