Data Searching using Indistinct Matching
by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC
Data Searching is a very complex art in itself. Ignoring the long history of
Internet "search engines" growing from Archie in 1992 to today's Google engine.
Original techniques to provide users with "full text" search required a solution
designed for common mistakes, one of which is called Indistinct Matching.
In the mathematical discipline of graph theory a matching or edge-independent set
in a graph is a set of edges without common vertices. When applied to words, this
theory is capable of calculating words which sound a like but are spelled differently.
Unfortunately, it is on a rare occasion you will get the results with words that
you would expect when running this theory against graphs.
Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
Type
TRetCount = packed record
lngSubRows : Word;
lngCountLike : Word;
end;
function IndistinctMatching(MaxMatching : Integer;
strInputMatching: PChar;
strInputStandart: PChar): Integer;
Var
gret,tret:TRetCount;
lngCurLen:Integer;
begin
If (MaxMatching = 0) Or (strInputMatching^ = #0) Or
(strInputStandart^ = #0) Then begin
IndistinctMatching:= 0;
exit;
end;
gret.lngCountLike:= 0;
gret.lngSubRows := 0;
For lngCurLen:= 1 To MaxMatching do begin
tret:= Matching(strInputMatching, strInputStandart, lngCurLen);
gret.lngCountLike := gret.lngCountLike + tret.lngCountLike;
gret.lngSubRows := gret.lngSubRows + tret.lngSubRows;
tret:= Matching(strInputStandart, strInputMatching, lngCurLen);
gret.lngCountLike := gret.lngCountLike + tret.lngCountLike;
gret.lngSubRows := gret.lngSubRows + tret.lngSubRows;
end;
If gret.lngSubRows = 0 Then begin
IndistinctMatching:= 0;
exit;
end;
IndistinctMatching:= Trunc((gret.lngCountLike / gret.lngSubRows) * 100);
end;
Using this algorithm, both "Robert" and "Rupert" return "42%" the same,
while "Rubin" and "Robin" yields "49%". "Ashcroft" and "Ashcraft" yields "57%".
However, "George" and "Jorge" yields "60%", and "Dave" and "David"
yields "49%" the same.
In this sample, we see another way of veiwing the differences between words.
In this technique, there is a higher penalty for words that sound a like but
are not spelled a like. However, notice that George and Jorge scored the
highest due to the fact that 4 of the 5 "sides" are identical, only the first
"side/sides" were different. Yet, for a pattern that differs in the middle
or even the end - the score drops. So, this theory is useful for pattern
matches that probably start differently.
Please continue reading the other algorithms I use in my search engines.
G.E. Ozz Nixon Jr.