Find all duplicate files by MD5 hash

All we need is an easy explanation of the problem, so here it is.

I’m trying to find all duplicate files (based on MD5 hash) and ordered by file size. So far I have this:

find . -type f -print0 | xargs -0 -I "{}" sh -c 'md5sum "{}" |  cut -f1 -d " " | tr "\n" " "; du -h "{}"' | sort -h -k2 -r | uniq -w32 --all-repeated=separate

The output of this is:

1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture.sh
1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture2.sh
1832348bb0c3b0b8a637a3eaf13d9f22 4.0K   ./picture2.s

d41d8cd98f00b204e9800998ecf8427e 0      ./test(1).log

Is this the most efficient way?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

From “man xargs”: -I implies -L 1
So this is not most efficient. It would be more efficient, if you just give as many filenames to md5sum as possible, which would be:

find . -type f -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Then you won’t have the file size of course. If you really need the file size, create a shell script, which does the md5sum and du -h and merge the lines with join.

Method 2

Sometimes we are working on reduced sets of linux commands, like busybox or other things that comes with NAS and other linux embedded hardwares (IoTs). In these cases we can’t use options like -print0, getting troubles with namespaces. So we may prefer instead:

find | while read file; do md5sum "$file"; done > /destination/file

Then, our /destination/file is ready for any kind of process like sort and uniq as usual.

Method 3

Use either btrfs + duperemove or zfs with online dedupe. It works on the file system level and will match even equal file parts and then use the file system’s CoW to retain only one of each while leaving the files in place. When you modify one of the shared parts in one of the files it will write the change separately. That way you can have things like /media and /backup/media-2017-01-01 consume only the size of each unique piece of information in both trees.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply