Data Searching using Soundex
by: G.E. Ozz Nixon Jr.
Published: September 2007
©opyright 2009 by Friends of FPC
Data Searching is a very complex art in itself. Ignoring the long history of
Internet "search engines" growing from Archie in 1992 to today's Google engine.
Original techniques to provide users with "full text" search required a solution
designed for common mistakes, one of which is called Soundex.
Soundex is a phonetic algorithm for indexing words by sound as pronounced in
American English. The goal of course was to encode homophones to the same
representation so they could be matched despite minor differences in spelling.
Unfortunately Soundex is probably the most used algorithm used in data
searching and is often used incorrectly. Soundex is quite simple:
Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function SoundEx(Const S: string): string;
const
Table: array[1..26] of Char = '.123.12..22455.12623.1.2.2';
var
SoundString: string[255];
I1: Integer;
I2: Integer;
isNum: boolean;
Ch: Char;
Str:String;
MaxLoop:Integer;
begin
Result := S;
if S = '' then Exit;
Str:=S;
isNum := true;
repeat
Ch := UpCase(S[1]);
if Ch > #64 then isNum := false
else Delete(Str, 1, 1);
until (isNum = false) or (Str = '');
Result := Str;
if Str = '' then Exit;
SoundString[0] := #255;
FillChar2(SoundString[1], 255, '0');
// Step 1: ASCII to Soundex
{$IFDEF ASSEMBLY}
asm // 1090108: By Ozz (FASTER THAN Length()!)
MOV EAX, S; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV MaxLoop, EAX; // Put into Result
End;
{$ELSE}
MaxLoop:=Length(S);
{$ENDIF}
for I1 := 1 to MaxLoop - 1 do begin
I2 := Ord(UpCase(S[I1 + 1])) - 64;
if ((I2 < 1) or (I2 > 26)) then I2 := 1;
SoundString[I1] := Table[I2];
end;
// Initialize for second pass
I1 := 1;
repeat
while (SoundString[I1] = '.') do
Delete(SoundString, I1, 1);
while ((SoundString[I1] = SoundString[I1 + 1]) and (SoundString[I1] <> '0')) do
Delete(SoundString, I1, 1);
Inc(I1);
until (SoundString[I1] = '0');
Result := Ch + Copy(SoundString, 1, 3);
end;
Using this algorithm, both "Robert" and "Rupert" return the same string "R163"
while "Rubin" and "Robin" yields "R150". "Ashcroft" and "Ashcraft" yields "A226".
However, "George" and "Jorge" yields "G620" and "J620" and "Dave" and "David"
yields "D100" and "D130".
This simple example shows a need for other comparison algorithms. Read more on
other comparison techniques I used in my search engines.
G.E. Ozz Nixon Jr.