If you're anything like me, you may have years (or, more like me, decades) of data that you've kept backed up from machine to machine, or old drive, to new drive, to larger drive, and on and on.
This year I was collecting several drives of old backups onto a new drive, and I realized I easily had several hundred gigs of duplicate data throughout the drives in various folders and places.
My initial thought was to try and manually go through and remove copies manually, but I quickly realized that this was a task of pain and suffering to come. What did I do? Head over to google, and find about 5 or 6 tools that say they can do this. Narrow it down by removing the Windows only tools, and keep going with the remaining three. Next, dig in a bit and read about each of them, and yes, finally decide to give rdFind a try.
Let me just say, it worked incredibly well. It helped me identify 355 GB of duplicate files, and after a dry-run to identify them and make sure it was getting them correctly identified; it helped me delete the duplicates, only keeping one copy.
How Does it Work?
First, rdFind checks the files by initial bits, then final gits, then it creates a checksum of each file (which you can define), and compares it to all the other files. It uses those various methods of file identification to more accurately determine which files are actual duplicates, and not just named similarly or similar in size.
For Ubuntu / Debian based systems:
sudo apt install rdfind
For Fedora / CentOS / Red Hat with dnf do:
sudo dnf install rdfind
You can run it on windows using CygWin, or MacOS with a HomeBrew install as well.
The Basics of Running rdFind
Anytime you want more information on how to use a command line tool, you can use the
man command followed by the tool name. In this case we can use
to see a ton of usage options and information.
Essentially, we want to search inside a directory / drive and all of it's subdirectories (recursively) and let rdFind look for any and all duplicate files. But before we delete anything we want to do a dry run just so we can see what is identified as duplicates, and make sure everything looks correct. This helps us ensure against undesired data loss.
rdfind -dryrun true <path to the folder you want to scan>
In my case, I ran this for the video on a folder I created in my home directory with some intentionally duplicated files to show how it works. But on my home drive I ran it on a 4 TB USB drive full of data.
Depending on the drive size, and amount of data in the drive or directory you want to scan the scan can be from milliseconds to hours.
On my 4TB drive, the dry run took about 45 minutes to run.
When it's done, the dry run provides a file called
results.txt in the directory you ran the scan from.
You can look at the contents of results.txt with any text editor you want, and check out what has been identified as a duplicated file.
When you're ready to delete the fuplicates, you can run the same command with a different flag.
rdfind -deleteduplicates true <path to the folder you want to scan>
This time the scan will re-identify the duplicates, and delete them from the drive or directory.
Do not run this scan until you have triple verified the path to the drive or directory you intend to scan. Once done, it can't be undone.
I now have a Drive with all the backup data I want to keep but with 355 GB more space than I had when I started.
It's a great tool, very simple to use, and yet very pwoerful.
Use it wisely.