Find and Delete All Duplicate Files

I was asked about this question today but can't seem to think of a quick answer to solve this issue. Typical manual solution is to just compare the file size and file content through hashing or checksum.

It seems there are quite a number of duplicate file finder tools but we will try with a console tool called fdupes. Typical usage of this program.

1. Install the program.
$ sudo apt-get install fdupes

2. Create sample testing files.
$ cd /tmp
$ wget http://upload.wikimedia.org/wikipedia/en/a/a9/Example.jpg -O a.jpg
$ cp a.jpg b.jpeg
$ touch c.jpg d.jpeg

3. Show all duplicate files.
$ fdupes -r  .
./a.jpg                                 
./b.jpeg

./c.jpg
./d.jpeg

4. Show all duplicate file but omit the first file.
$ fdupes -r -f .
./b.jpeg                                

./d.jpeg

5. Similar to step 4 but delete the duplicate files and keep one copy.
$ fdupes -r -f . | grep -v '^$' | xargs rm -v
removed `./b.jpeg'                      
removed `./d.jpeg'

On a similar note, there is this interesting read on optimized way by Dupseek, an app that find duplicate files. The main strategy is just group these files by size and start comparing them by set and ignore those set with just one file.

Unfortunately, I've a hard time understand the Perl code. The closet and simplest implementation I can found is the tiny find-duplicates Python script.

No comments:

Post a Comment