Drive Crawling Techniques
by: G.E. Ozz Nixon Jr.
Published: August 2009
©opyright 2009 by Friends of FPC
While developing a security audit solution for cross-platform
tracking of file changes, I ran into a situation where I thought the machine I was on was
capable of doing a full listing of every file on the drive much faster. After developing the
second technique and not seeing any improvement, I tried "ls -lR" (Dir /s for windows) and
the result was a little slower than either technique documented in this research document.
The techniques described here are not new, not inventive, not intuitive - however, over the
past 10 years developers have became more and more dependent upon "components". There is
actually a negative to this dependency - the developers do not know how to do simple low-level
techniques. To crawl a drive (as I call it) in pascal you use the FindFirst, FindNext and
FindClose routines. These routines use a variable structure called TSearchRec, which contains
the filename, size and DOS based timestamp along with different attributes. These routines
were introduced to the pascal language back in the DOS era, so they do not contain all of
the extended attributes of the last 14+ years since the "leap" to Windows.
FindFirst is a function which returns an Integer, where zero is success. So to crawl a drive
under Linux, Windows or Mac, you do a FindFirst(Path+Pattern, faAnyFile, SearchRecord). The
path and pattern are where to start and what to look for, e.g. C:\*.* (Windows and DOS) and
/* (Linux and Mac). The faAnyFile is a combined attribute to look for System, Hidden, ReadOnly.
Free Pascal (2.0 and later) introduced the ability to also denote Symbolic Links. Recent
Windows operating systems introduced linking like Unix's Symbolic Link. A Symbolic Link is
a file system entry which appears like a normal file or folder, however, it is a "pointer"
to another file or folder.
The reason I point out the Symbolic Link in a little more detail, is, in the Unix world and
especially on the Mac - your symbolic links can often be circular. For example, on the Mac
you have a folder "/Volumes", which will contain a Symbolic link to the current drive and
optionally other drives. In my case "Applications -> /" (The root of the drive). So crawling
down would see:
/Volumes
/Volumes/Applications
and my code would then try to go into that folder, which is "/" (the root of the drive).
Which, as you may have guessed, contains the "Volumes" folder, which contains the Applications
folder which of course, points me right back to the root -- producing a circular reference.
ScanDisk Design 1
dure scanDisk1(const path:AnsiString);
const
{$IFDEF UNIX}
pattern = '*';
{$ELSE}
pattern = '*.*';
{$ENDIF}
var
SRec:TSearchRec;
Err:Integer;
begin
Err:=FindFirst(path+pattern,faAnyFile {$IFDEF UNIX}+ faSymLink{$ENDIF},SRec);
while Err=0 do begin
if (SRec.Attr and faDirectory)=faDirectory then begin
{$IFDEF UNIX}
if SRec.mode and 61440 = 40960 then begin
Err:=FindNext(SRec);
Continue;
end;
{$ENDIF}
if (SRec.Name='.') then begin
end
else if (SRec.Name='..') then begin
end
else // recursive:
{$IFDEF UNIX}
scanDisk1(path+SRec.Name+'/');
{$ELSE}
scanDisk1(path+SRec.Name+'\');
{$ENDIF}
end
else begin
System.Writeln(Path+SRec.Name);
end;
Err:=FindNext(SRec);
End;
FindClose(SRec);
end;
Download ScanDisk Design 1
The above routine was the fastest of the two, however, it implements a recursive stack call to itself.
In the DOS days, this was a technique that you worried about, as too many recursions produces a
stack overflow error. Not something people worry about or probably even know about. What happens
is every program has a static memory region, known as your stack, which is a predefined size. If
a routine calls itself, or a routine which calls back to the original routine, this requires the
"local" variables to be PUSH(ed) onto the "Stack". If you do it enough time, you will get a stack
overflow error.
For example, if you want to see this happen, I wrote a quick and simple piece of code you
can download called stakfail.dpr (so it can be compiled in Windows Delphi, or FPC across all
platforms). Compile it, run it, and viola - stack overflow! (Segmentation fault in Linux/Mac).
Download StakFail.dpr
ScanDisk Design 2
dure scanDisk2(const rootpath:AnsiString);
const
{$IFDEF UNIX}
pattern = '*';
{$ELSE}
pattern = '*.*';
{$ENDIF}
var
PathTreePos,PathTreeCtr,PathTreeTop:LongWord;
PathTree:Array of String;
procedure scanner(const path:AnsiString);
var
SRec:TSearchRec;
Err:Integer;
begin
Err:=FindFirst(path+pattern,faAnyFile {$IFDEF UNIX}+ faSymLink{$ENDIF},SRec);
while Err=0 do begin
if (SRec.Attr and faDirectory)=faDirectory then begin
{$IFDEF UNIX}
if SRec.mode and 61440 = 40960 then begin
Err:=FindNext(SRec);
Continue;
end;
{$ENDIF}
if (SRec.Name='.') then begin
end
else if (SRec.Name='..') then begin
end
else begin
inc(pathtreectr);
if pathtreectr=pathtreetop-10 then begin
inc(pathtreetop,1000);
setlength(pathtree,pathtreetop);
end;
{$IFDEF UNIX}
pathtree[pathtreectr-1]:=path+SRec.Name+'/';
{$ELSE}
pathtree[pathtreectr-1]:=path+SRec.Name+'\';
{$ENDIF}
end;
end
else begin
System.Writeln(Path+SRec.Name);
end;
Err:=FindNext(SRec);
End;
FindClose(SRec);
end;
begin
pathtreepos:=0;
pathtreetop:=1000;
pathtreectr:=1;
setlength(pathtree,pathtreetop);
pathtree[0]:=rootpath;
while pathtreepos < pathtreectr do begin
scanner(pathtree[pathtreepos]);
inc(pathtreepos);
end;
setlength(pathtree,0);
end;
Download ScanDisk Design 2
The above routine is almost as fast as the first routine, however, it requires a little more CPU
time and uses more memory, especially on a busy development machine with over 2 million files on it.
Note
{$IFDEF UNIX}
if SRec.mode and 61440 = 40960 then begin
Err:=FindNext(SRec);
Continue;
end;
{$ENDIF}
The above routine is KEY for avoiding symbolic links, in my implementation it
falls under attr is faDirectory - thus avoiding symbolic link directories. This
is not documented anywhere on the FPC site, in the code, etc. It took me about
two hours of hacking around through the kernel, and "ls" source code and converting
to DECIMAL. So if you need to know how to detect a faSymLink in Free Pascal
the above routine is your trick. If you are familiar with standard TSearchRec
you know ".mode" does not normal exist. This is a Free Pascal extension to the
old DOS TSearchRec structure.
G.E. Ozz Nixon Jr.