FreePascal Information Logo Friend of FreePascal Compiler Title
Articles with Feedback, FPC News Library, PDF Collection, Mail Lists, Books, Newsgroups, IRC Open online discussion areas Research and Tutorials Tools, Compilers and Utilities Blurbs about us, advertising, etc.
Welcome to the FoFPC Research Notes: "Drive Crawling Techniques"

Drive Crawling Techniques

by: G.E. Ozz Nixon Jr.
Published: August 2009
©opyright 2009 by Friends of FPC



     While developing a security audit solution for cross-platform tracking of file changes, I ran into a situation where I thought the machine I was on was capable of doing a full listing of every file on the drive much faster. After developing the second technique and not seeing any improvement, I tried "ls -lR" (Dir /s for windows) and the result was a little slower than either technique documented in this research document.

     The techniques described here are not new, not inventive, not intuitive - however, over the past 10 years developers have became more and more dependent upon "components". There is actually a negative to this dependency - the developers do not know how to do simple low-level techniques. To crawl a drive (as I call it) in pascal you use the FindFirst, FindNext and FindClose routines. These routines use a variable structure called TSearchRec, which contains the filename, size and DOS based timestamp along with different attributes. These routines were introduced to the pascal language back in the DOS era, so they do not contain all of the extended attributes of the last 14+ years since the "leap" to Windows.

     FindFirst is a function which returns an Integer, where zero is success. So to crawl a drive under Linux, Windows or Mac, you do a FindFirst(Path+Pattern, faAnyFile, SearchRecord). The path and pattern are where to start and what to look for, e.g. C:\*.* (Windows and DOS) and /* (Linux and Mac). The faAnyFile is a combined attribute to look for System, Hidden, ReadOnly. Free Pascal (2.0 and later) introduced the ability to also denote Symbolic Links. Recent Windows operating systems introduced linking like Unix's Symbolic Link. A Symbolic Link is a file system entry which appears like a normal file or folder, however, it is a "pointer" to another file or folder.

     The reason I point out the Symbolic Link in a little more detail, is, in the Unix world and especially on the Mac - your symbolic links can often be circular. For example, on the Mac you have a folder "/Volumes", which will contain a Symbolic link to the current drive and optionally other drives. In my case "Applications -> /" (The root of the drive). So crawling down would see:
/Volumes
/Volumes/Applications
and my code would then try to go into that folder, which is "/" (the root of the drive). Which, as you may have guessed, contains the "Volumes" folder, which contains the Applications folder which of course, points me right back to the root -- producing a circular reference.

ScanDisk Design 1
dure scanDisk1(const path:AnsiString);
const
{$IFDEF UNIX}
   pattern = '*';
{$ELSE}
   pattern = '*.*';
{$ENDIF}

var
   SRec:TSearchRec;
   Err:Integer;

begin
   Err:=FindFirst(path+pattern,faAnyFile {$IFDEF UNIX}+ faSymLink{$ENDIF},SRec);
   while Err=0 do begin
      if (SRec.Attr and faDirectory)=faDirectory then begin
{$IFDEF UNIX}
         if SRec.mode and 61440 = 40960 then begin
            Err:=FindNext(SRec);
            Continue;
         end;
{$ENDIF}
         if (SRec.Name='.') then begin
         end
         else if (SRec.Name='..') then begin
         end
         else // recursive:
{$IFDEF UNIX}
             scanDisk1(path+SRec.Name+'/');
{$ELSE}
             scanDisk1(path+SRec.Name+'\');
{$ENDIF}
      end
      else begin
         System.Writeln(Path+SRec.Name); 
      end;
      Err:=FindNext(SRec);
   End;
   FindClose(SRec);
end;

    Download Source IconDownload ScanDisk Design 1

      The above routine was the fastest of the two, however, it implements a recursive stack call to itself. In the DOS days, this was a technique that you worried about, as too many recursions produces a stack overflow error. Not something people worry about or probably even know about. What happens is every program has a static memory region, known as your stack, which is a predefined size. If a routine calls itself, or a routine which calls back to the original routine, this requires the "local" variables to be PUSH(ed) onto the "Stack". If you do it enough time, you will get a stack overflow error.

     For example, if you want to see this happen, I wrote a quick and simple piece of code you can download called stakfail.dpr (so it can be compiled in Windows Delphi, or FPC across all platforms). Compile it, run it, and viola - stack overflow! (Segmentation fault in Linux/Mac).
    Download Source IconDownload StakFail.dpr

ScanDisk Design 2
dure scanDisk2(const rootpath:AnsiString);
const
{$IFDEF UNIX}
   pattern = '*';
{$ELSE}
   pattern = '*.*';
{$ENDIF}

var
   PathTreePos,PathTreeCtr,PathTreeTop:LongWord;
   PathTree:Array of String;

procedure scanner(const path:AnsiString);
var
   SRec:TSearchRec;
   Err:Integer;

begin
   Err:=FindFirst(path+pattern,faAnyFile {$IFDEF UNIX}+ faSymLink{$ENDIF},SRec);
   while Err=0 do begin
      if (SRec.Attr and faDirectory)=faDirectory then begin
{$IFDEF UNIX}
         if SRec.mode and 61440 = 40960 then begin
            Err:=FindNext(SRec);
            Continue;
         end;
{$ENDIF}
         if (SRec.Name='.') then begin
         end
         else if (SRec.Name='..') then begin
         end
         else begin
             inc(pathtreectr);
             if pathtreectr=pathtreetop-10 then begin
                inc(pathtreetop,1000);
                setlength(pathtree,pathtreetop);
             end;
{$IFDEF UNIX}
             pathtree[pathtreectr-1]:=path+SRec.Name+'/';
{$ELSE}
             pathtree[pathtreectr-1]:=path+SRec.Name+'\';
{$ENDIF}
         end;
      end
      else begin
         System.Writeln(Path+SRec.Name);
      end;
      Err:=FindNext(SRec);
   End;
   FindClose(SRec);
end;

begin
   pathtreepos:=0;
   pathtreetop:=1000;
   pathtreectr:=1;
   setlength(pathtree,pathtreetop);
   pathtree[0]:=rootpath;
   while pathtreepos < pathtreectr do begin
      scanner(pathtree[pathtreepos]);
      inc(pathtreepos);
   end;
   setlength(pathtree,0);
end;

    Download Source IconDownload ScanDisk Design 2

      The above routine is almost as fast as the first routine, however, it requires a little more CPU time and uses more memory, especially on a busy development machine with over 2 million files on it.

Note
{$IFDEF UNIX}
         if SRec.mode and 61440 = 40960 then begin
            Err:=FindNext(SRec);
            Continue;
         end;
{$ENDIF}


      The above routine is KEY for avoiding symbolic links, in my implementation it falls under attr is faDirectory - thus avoiding symbolic link directories. This is not documented anywhere on the FPC site, in the code, etc. It took me about two hours of hacking around through the kernel, and "ls" source code and converting to DECIMAL. So if you need to know how to detect a faSymLink in Free Pascal the above routine is your trick. If you are familiar with standard TSearchRec you know ".mode" does not normal exist. This is a Free Pascal extension to the old DOS TSearchRec structure.

G.E. Ozz Nixon Jr.
 Links and Products we find useful



ButtonGenerator.com
Valid XHTML 1.0 Transitional Internet Map
Programmer's Heaven
grat-i-fi-ca-tion - noun
the state of being gratified; great satisfaction.


"Wow! We are so pleased to see Friends of FreePascal Compiler ... with such a cool look and feel!"

"We like the fact you wrote all of the server scripts using FPC!"

Ian Wright
Codemasters
Locations of visitors to this page world map hits counter
Copyright 2009 by 3F, LLC. All rights reserved. Worldwide.
Your request was processed by server #2 in 0.006707 secs.

sponsor
Click to visit our paid sponsor