File::Searcher::Similars - Similar files locator
use File::Searcher::Similars;
File::Searcher::Similars->init(0, \@ARGV); similarity_check_name();
Similar-sized and similar-named files are picked as suspicious candidates of duplicated files.
Extremely fast file similarity checker. It uses advanced soundex vector algorithm to determine the similarity between files. Generally it means that if there are n files, each having approximately m words, the degree of calculation is merely
O(n^2 * m)
which is over hundreds times faster than any existing file fingerprinting technology.
The following is excerpted from self-test. I hope it is self-explaining:
== In testing 2, you should see test result as the following. Please chech if they match.
- - >8 - -
## ========= 3 'PopupTest.java' 'test/' 3 'CardLayoutTest.java' 'test/' 4 'PopupButtonFrame.class' 'test/'
## ========= 4 'BinderyHelloWorld.jpg' 'test/' 5 'SmallHello.gif' 'test/' 5 'MacHelloWorld.gif' 'test/' 6 'MacintoshHelloWorrld.bmp' 'test/' - - >8 - -
== In testing 3, you should see test result as the following. Please chech if they match.
Note that this time the previous one big group has been split into 2 small groups, with each group holding similar-sized files, due to the effect that the size of 'SmallHello.gif' becomes bigger.
- - >8 - -
## ========= 3 'PopupTest.java' 'test/' 3 'CardLayoutTest.java' 'test/' 4 'PopupButtonFrame.class' 'test/'
## ========= 6 'MacintoshHelloWorrld.bmp' 'test/' 8 'SmallHello.gif' 'test/'
## ========= 4 'BinderyHelloWorld.jpg' 'test/' 5 'MacHelloWorld.gif' 'test/' - - >8 - -
The File::Searcher::Similars package comes with a fully functional demo script fileSimilars.pl. Please refer to its help file for further explanations.
This package is highly customizable. Refer to hash variable %config and/or the 3 arrwash_ functions for customization hints.
Author: SUN, Tong <suntong at users sourceforge net> HomeURL: http://xpt.sourceforge.net/
File::Compare(3), perl(1)
and the following scripts.
## File::Find::Duplicates - Find duplicate files
http://belfast.pm.org/Modules/Duplicates.html
my %dupes = find_duplicate_files('/basedir1', '/basedir2');
When passed a base directory (or list of such directories) it returns a hash, keyed on filesize, of lists of the identical files of that size.
## ch::claudio::finddups - Find duplicate files in given directory
http://www.claudio.ch/Perl/finddups.html
ch::claudio::finddups is a script as well as a package. When called as script it will search the directory and its subdirectories for files with (possibly) identical content.
To find identical files fast this program will just remember the Digest::SHA1 hash of each file, and signal two files as equal if their hash matches. It will output lines that can be given to a bourne shell to compare the two files, and remove one of them if the comparison indicated that the files are indeed identical.
Besides that it can be used as a package, and gives so access to the following variables, routines and methods.
## dupper.pl - finds duplicate files, optionally removes them
http://sial.org/code/perl/scripts/dupper.pl.html
Script to find (and optionally remove) duplicate files in one or more directories. Duplicates are spotted though the use of MD5 checksums.
Copyright (c) 2001-2003 Tong SUN. All rights reserved.