Click here to get rid of any annoying frames

Similar file finder

Newsgroups: comp.lang.perl.misc,comp.unix.programmer

Hi,

I'm planing to write a "Similar file finder", which will walks along the given dirs to find all similar files within it.

First I want to know if anybody has written similar tools already.
Searching intensively in news archive, I can only located such request/respond in newsgroups: comp.graphics.apps.paint-shop-pro (what a strange place to find such kind of program :->)
http://groups.google.com/groups?hl=en&threadm=LhAh4.11633%24NU6.569262%40tw12.nn.bcandid.com&rnum=56&prev=/groups%3Fq%3Dfile%2Bsimilar%2Bcompare%2Bscript%2BOR%2Bprogram%26num%3D50%26hl%3Den%26start%3D50%26sa%3DN
,----- [quotes from above thread] ---
| Now you can down load any number of similar files and run the program to
| compare the incoming files against the list of "got that one" Even if
| someone renamed the file the program will recognize that and rename it to
| the correct name...
|
| If you want to get an idea of how powerful this tool is just substitute the
| word "font" in the above paragraph with any file extension you like (.jpg
| .gif .bmp .mp3 .mpg .avi etc.)
|
| How it works inside is kind of technical( I don't understand all of
| it)... so far the program has worked on every file type I have tried
| it on.
`-----
I did not include the DOS program's name here because what seems as such an amazing tools to a PSP user is just a CRC checksum comparing program. I'm sure most of us here can hack one in just minutes.
What's in my mind is much more powerful than it. It can not only find out identical files but also find out similar files. Ok, what are similar files? Files that have different file name, time and size (might be content also), and yet they represent same thing.
Is such program really necessary? Why would files that have different name, time and size represent same thing? Well, does words like "Napster", "Gnutella" ring the bell? Different names for same file are not rare at all. Different version (.txt, .html, or .pdf) and different compression methods (.zip, .gz, .tar.gz, .bip2) make it even worse. And let alone there are partial downloads floating around everywhere. Moreover, sample rate make a huge different in MP3 files, even if they sound no much different to human ears.
One poster said (in above thread):
,-----
| I have over 200 cd that I have burned full of different files. Also three
| HD's. There is not a duplicate of any file anywhere on any of my storage
| mediums.
`-----
Well, I have much much less collection than s/he did, but I'm sure more than 10% of my collection are duplicated similar files, and the percentage is very likely above that.

Similar file finder

> It sounds like you're trying to reinvent what Napster calls the
> "fingerprinting technology". And it seems to me that that isn't exactly
> simple. No, I'm not an insider.
>
> So let's say you want to recognize files of the same audio track, or
> image files at a different resolution. So what can you do? First,
No I'm not planing to be that fancy. I'm going to make the guess only based on the file name and file size.
I'd be happy enough if my program can pick out the following (among thousands of files) as similar file candidates:
Andie Macdowell.jpg 8k andy macddowel.gif 12k
As to the algorithm, I'm going to map each individual word in file name into soundex, and use term vector (I borrow this term and thought from the famous TF/IDF information retrieval algorithm for similarity calculation) to determine the similarity between files. Generally it means that if there are n files, each having approximately m words, the degree of calculation is
O(n^2 * m)
File size is also taken into consideration.