FreePascal Information Logo Friend of FreePascal Compiler Title
Articles with Feedback, FPC News Library, PDF Collection, Mail Lists, Books, Newsgroups, IRC Open online discussion areas Research and Tutorials Tools, Compilers and Utilities Blurbs about us, advertising, etc.
Welcome to the FoFPC Research Notes: Indistinct Matching

Data Searching using Indistinct Matching

by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC



     Data Searching is a very complex art in itself. Ignoring the long history of Internet "search engines" growing from Archie in 1992 to today's Google engine. Original techniques to provide users with "full text" search required a solution designed for common mistakes, one of which is called Indistinct Matching.

     In the mathematical discipline of graph theory a matching or edge-independent set in a graph is a set of edges without common vertices. When applied to words, this theory is capable of calculating words which sound a like but are spelled differently. Unfortunately, it is on a rare occasion you will get the results with words that you would expect when running this theory against graphs.

Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
Type
     TRetCount = packed record
        lngSubRows   : Word;
        lngCountLike : Word;
     end;

function IndistinctMatching(MaxMatching     : Integer;
                            strInputMatching: PChar;
                            strInputStandart: PChar): Integer;
Var
    gret,tret:TRetCount;
    lngCurLen:Integer;

begin
    If (MaxMatching = 0) Or (strInputMatching^ = #0) Or
       (strInputStandart^ = #0) Then begin
        IndistinctMatching:= 0;
        exit;
    end;

    gret.lngCountLike:= 0;
    gret.lngSubRows  := 0;
    For lngCurLen:= 1 To MaxMatching do begin
        tret:= Matching(strInputMatching, strInputStandart, lngCurLen);
        gret.lngCountLike := gret.lngCountLike + tret.lngCountLike;
        gret.lngSubRows   := gret.lngSubRows + tret.lngSubRows;
        tret:= Matching(strInputStandart, strInputMatching, lngCurLen);
        gret.lngCountLike := gret.lngCountLike + tret.lngCountLike;
        gret.lngSubRows   := gret.lngSubRows + tret.lngSubRows;
    end;

    If gret.lngSubRows = 0 Then begin
        IndistinctMatching:= 0;
        exit;
    end;
    IndistinctMatching:= Trunc((gret.lngCountLike / gret.lngSubRows) * 100);
end;

     Using this algorithm, both "Robert" and "Rupert" return "42%" the same, while "Rubin" and "Robin" yields "49%". "Ashcroft" and "Ashcraft" yields "57%". However, "George" and "Jorge" yields "60%", and "Dave" and "David" yields "49%" the same.

     In this sample, we see another way of veiwing the differences between words. In this technique, there is a higher penalty for words that sound a like but are not spelled a like. However, notice that George and Jorge scored the highest due to the fact that 4 of the 5 "sides" are identical, only the first "side/sides" were different. Yet, for a pattern that differs in the middle or even the end - the score drops. So, this theory is useful for pattern matches that probably start differently. Please continue reading the other algorithms I use in my search engines.

G.E. Ozz Nixon Jr.
 Links and Products we find useful



ButtonGenerator.com
Valid XHTML 1.0 Transitional Internet Map
Programmer's Heaven
grat-i-fi-ca-tion - noun
the state of being gratified; great satisfaction.


"Wow! We are so pleased to see Friends of FreePascal Compiler ... with such a cool look and feel!"

"We like the fact you wrote all of the server scripts using FPC!"

Ian Wright
Codemasters
Locations of visitors to this page world map hits counter
Copyright 2009 by 3F, LLC. All rights reserved. Worldwide.
Your request was processed by server #2 in 0.003054 secs.

sponsor
This sponsor helps us with our documentation