Skip to content

Instantly share code, notes, and snippets.

@tfeldmann
Last active December 27, 2024 18:50
Show Gist options
  • Save tfeldmann/fc875e6630d11f2256e746f67a09c1ae to your computer and use it in GitHub Desktop.
Save tfeldmann/fc875e6630d11f2256e746f67a09c1ae to your computer and use it in GitHub Desktop.
Fast duplicate file finder written in python
#!/usr/bin/env python
"""
Fast duplicate file finder.
Usage: duplicates.py <folder> [<folder>...]
Based on https://stackoverflow.com/a/36113168/300783
Modified for Python3 with some small code improvements.
"""
import os
import sys
import hashlib
from collections import defaultdict
def chunk_reader(fobj, chunk_size=1024):
""" Generator that reads a file in chunks of bytes """
while True:
chunk = fobj.read(chunk_size)
if not chunk:
return
yield chunk
def get_hash(filename, first_chunk_only=False, hash_algo=hashlib.sha1):
hashobj = hash_algo()
with open(filename, "rb") as f:
if first_chunk_only:
hashobj.update(f.read(1024))
else:
for chunk in chunk_reader(f):
hashobj.update(chunk)
return hashobj.digest()
def check_for_duplicates(paths):
files_by_size = defaultdict(list)
files_by_small_hash = defaultdict(list)
files_by_full_hash = dict()
for path in paths:
for dirpath, _, filenames in os.walk(path):
for filename in filenames:
full_path = os.path.join(dirpath, filename)
try:
# if the target is a symlink (soft one), this will
# dereference it - change the value to the actual target file
full_path = os.path.realpath(full_path)
file_size = os.path.getsize(full_path)
except OSError:
# not accessible (permissions, etc) - pass on
continue
files_by_size[file_size].append(full_path)
# For all files with the same file size, get their hash on the first 1024 bytes
for file_size, files in files_by_size.items():
if len(files) < 2:
continue # this file size is unique, no need to spend cpu cycles on it
for filename in files:
try:
small_hash = get_hash(filename, first_chunk_only=True)
except OSError:
# the file access might've changed till the exec point got here
continue
files_by_small_hash[(file_size, small_hash)].append(filename)
# For all files with the hash on the first 1024 bytes, get their hash on the full
# file - collisions will be duplicates
for files in files_by_small_hash.values():
if len(files) < 2:
# the hash of the first 1k bytes is unique -> skip this file
continue
for filename in files:
try:
full_hash = get_hash(filename, first_chunk_only=False)
except OSError:
# the file access might've changed till the exec point got here
continue
if full_hash in files_by_full_hash:
duplicate = files_by_full_hash[full_hash]
print("Duplicate found:\n - %s\n - %s\n" % (filename, duplicate))
else:
files_by_full_hash[full_hash] = filename
if __name__ == "__main__":
if sys.argv[1:]:
check_for_duplicates(sys.argv[1:])
else:
print("Usage: %s <folder> [<folder>...]" % sys.argv[0])
@romainjouin
Copy link

All that would be nicely enhanced with some multi-cpu parallelism where possible

@tfeldmann
Copy link
Author

I guess this is more IO than CPU bound, so I don't think it would benefit much from multi CPU... I would love to be proven wrong!

@romainjouin
Copy link

I recently had a task I thought was IO bound : I had to read 400 K files, but when I multi-threaded it on 8 threads, the performance definitely improved, by a factor I would guess at least of 2 or 3. As here there may be millions of files, I think it could be worthwhile to test.

@tminakov
Copy link

tminakov commented Jun 15, 2020

The original SO poster here 👋.
I would argue on this comment:

The script on stackoverflow has a bug which could lead to false positives.

It's not a bug, but sub-optimal execution - the small hashes are used for the input on the 2nd pass, not the final result; e.g. they don't pop up to the user, just increase the runtime.


Couple of comments on the gist:

  • there are bugs, hidden in the exceptions handling - if an exception does occur, the value that was to be assigned in the caught block is used outside of it; after all, this was a hacky script for SO answer :)
  • though not deprecated (and, probably never going to be), printf style formatting with %s is not the now-usual py3 way.

@jessicachinafile
Copy link

Thanks much for modifying this for Python3 and for posting it! Worked like a charm with no alterations needed.

@pavulon18
Copy link

pavulon18 commented Aug 5, 2020

I have found an issue with this script. I ran this script on my computer. One file was marked as a duplicate for many many other files.

The common file was: Default.rdp with a size of 0KB.

It was matched with several log files (text files) from a game I used to play (Star Wars: The Old Republic). These log files varied in size.

I copied and pasted just a few lines of output from the script:
Duplicate found:
\Documents\Backup from old laptop\Star Wars - The Old Republic\CombatLogs\combat_2014-07-29_21_58_20_214517.txt
\Documents\Default.rdp

Duplicate found:
\Documents\Backup from old laptop\Star Wars - The Old Republic\CombatLogs\combat_2014-08-02_13_37_30_520940.txt
\Documents\Default.rdp

Duplicate found:
\Documents\Backup from old laptop\Star Wars - The Old Republic\CombatLogs\combat_2014-08-10_00_13_50_659725.txt
\Documents\Default.rdp

Duplicate found:
\Documents\Backup from old laptop\Star Wars - The Old Republic\CombatLogs\combat_2014-08-13_12_09_11_572142.txt
\Documents\Default.rdp

What other information do you need to help with this?

I just looked further down the list and this same "Default.rdp" file also matched with several other files. It gets curiouser and curiouser.

@tfeldmann
Copy link
Author

Very interesting! Are you sure the logfiles are not empty? Can you post them somewhere so I can try to reproduce this?

@tfeldmann
Copy link
Author

@tminakov

It's not a bug, but sub-optimal execution - the small hashes are used for the input on the 2nd pass, not the final result; e.g. they don't pop up to the user, just increase the runtime.

You're right 👍 I removed the comment.

@pavulon18
Copy link

@tfeldmann

I should have looked closer before I posted my previous comment. When I went back through and looked at the combat logs, the ones that were listed were in fact empty.

thank you for pointing that out to me. :)

@alexandros-kyriakides
Copy link

You should add a condition to skip special files, such as named pipes. Otherwise, get_hash() will hang and the script will not terminate.

@ann1h1lan
Copy link

Hey guys, sorry if this is a really stupid question but I'm quite new to python!

Do I just drop it in the folder where i want to compare files?
Also, will this just report duplicates or actually remove duplicates? I want the latter, how can make that happen?

Any help is greatly appreciated:)

@tfeldmann
Copy link
Author

@ann1h1lan
You might want to check out organize where this code is used here
You can then define a rule like this to delete duplicates:

rules:
  - folders:
      - ~/Downloads
    subfolders: true
    filters:
      - duplicate
    actions:
      - trash

@Malam9
Copy link

Malam9 commented Apr 17, 2021

Can this be modified to only show duplicate folders? What about duplicate folders whose parents aren't duplicates?

@tfeldmann
Copy link
Author

Can this be modified to only show duplicate folders? What about duplicate folders whose parents aren't duplicates?

Do you mean the snippet or the organize config?
Of course it's possible. You'd have to recurse depth first into your folder tree and assign the hashes to compare.

@danielsabey
Copy link

Maybe this is a stupid question, but what would be the easiest way to only keep one of the duplicate files?

@rmcavalcante
Copy link

rmcavalcante commented Aug 1, 2021

Hello,

First of all many thanks for this code.
I have used it and, it works, except for ".MOV" files.
Since I was interested in deleting duplicated files, I have added

if(Path(duplicate).exists()):
   os.remove(duplicate)
   print("Deleted duplicate found:\n - %s\n" % (duplicate))

right after

if full_hash in files_by_full_hash:
   duplicate = files_by_full_hash[full_hash]
   print("Duplicate found:\n - %s\n - %s\n" % (filename, duplicate))

As you see I use python pathlib.Path to test if file exists before deleting it because some times the algorithm identifies the same file as a duplicate of different other files, as it was for my case.

In my case I had 'path01/F1.jpg', 'path02/F1 (1).jpg', 'path02/G1.jpg' and 'path03/G (1).jpg' all with the same content.

The algorithm then pointed 'F1.jpg' as a duplicated file of 'F1 (1).jpg', 'G1.jpg' and 'G (1).jpg', pairing ('F1.jpg', 'F1 (1).jpg'), ('F1.jpg', 'G1.jpg') and ('F1.jpg','G (1).jpg').

After deleting 'F1.jpg' for the first pair, it skips the other deletes since the same 'F1.jpg' file is pointed as duplicated.

Because of that, the algorithm does remove duplicated file, however, since it detects duplication in pairs, I had to run it several times, until I've got no duplicated files to delete.

I shared my experience just in case you would like to improve your code.

All the best to you.

Regards,

Roberto

@SushiWaUmai
Copy link

@rmcavalcante

For me, deleting the file instead of the duplicate solved the problem

so instead of

if(Path(duplicate).exists()):
   os.remove(duplicate)
   print("Deleted duplicate found:\n - %s\n" % (duplicate))

I did

if(Path(filename).exists()):
    os.remove(filename)
    print("Deleted duplicate found:\n - %s\n" % (filename))

@ntjess
Copy link

ntjess commented Nov 13, 2021

Great file! I wanted to group all duplicates together, not just pairs of them. I also wanted just one printout of all duplicates at the end of the script. If anyone else wants this, I created a fork.

E.g. if you have 4 files that are all duplicates of each other, it would print

Duplicate found:
- <path>\check_dups\myfile - Copy (2).rtf
- <path>\check_dups\myfile - Copy (3).rtf
- <path>\check_dups\myfile - Copy.rtf
- <path>\check_dups\myfile.rtf

In contrast, the original script would print:

Duplicate found:
 - <path>\check_dups\myfile - Copy (3).rtf
 - <path>\check_dups\myfile - Copy (2).rtf

Duplicate found:
 - <path>\check_dups\myfile - Copy.rtf
 - <path>\check_dups\myfile - Copy (2).rtf

Duplicate found:
 - <path>\check_dups\myfile.rtf
 - <path>\check_dups\myfile - Copy (2).rtf

@rjsdotorg
Copy link

rjsdotorg commented Nov 24, 2021

For our use, I quickly added a "-d" argument, for optional deletion.
The diff lines are:

  • 13
    from pathlib import Path
  • 35
    def check_for_duplicates(paths, b_delete=False):
  • 81
 if full_hash in files_by_full_hash:
                duplicate = files_by_full_hash[full_hash]
                print("Duplicate found:\n - %s\n - %s" % (filename, duplicate))
                if(b_delete and Path(filename).exists()):
                    os.remove(filename)
                    print("  1st file removed:\n   - %s\n" % (filename))
  • 88
if __name__ == "__main__":
    if len(sys.argv)<=1:
        print("Usage: %s [-d] <folder> [<folder>...]" % sys.argv[0])
    elif sys.argv[1]=='-d':
        if sys.argv[2:]:
            check_for_duplicates(sys.argv[2:], b_delete=True)
    elif sys.argv[1:]:
            check_for_duplicates(sys.argv[1:])

@sahil193101
Copy link

hiii guys i am new to python so i want to know where will i drop the folder path to detect the duplicates

@sahil193101
Copy link

need some help it would be much appreciated

@tfeldmann
Copy link
Author

You run this file with python3 duplicates.py folder1 path/to/folder2 folder3 and so on.

@sahil193101
Copy link

can you explain more how to do it ?

@tfeldmann
Copy link
Author

Install python 3, save this gist as duplicates.py and run the given command in a terminal of your choice.

@jcjveraa
Copy link

jcjveraa commented Jun 3, 2022

thanks @tfeldmann!

In case anyone else stumbles here - in my fork I have enabled automated hardlinking of duplicates. This is not something you want to do lightly as it can mess up your system, but it is what I was looking for to help reduce the size of my backups using rsnapshot (which does not handle moving files very well).

For you Thomas maybe a small improvement to the 'base' gist is grouping all files together using the true_duplicates dict, rather than just printing them out one by one.

@abidkhan03
Copy link

Hello, I understand your code and it is very well and you did the great work, but I have little bit problem with it like,
path = askdirectory(title='Select a directory')
if I want to ask from user to select path folder as I mentioned the script so, for this kind of solution where and how can I use that script.

@tfeldmann
Copy link
Author

@abidkhan03 You can use something like Gooey to create a simple GUI for this script

@datatalking
Copy link

thanks @tfeldmann!

In case anyone else stumbles here - in my fork I have enabled automated hardlinking of duplicates. This is not something you want to do lightly as it can mess up your system, but it is what I was looking for to help reduce the size of my backups using rsnapshot (which does not handle moving files very well).

@jcjveraa random question about the functionality of what you describe as "hardlinking" a duplicate.

I've read through the code and somewhere between line 120 to 146 of well written code I'm having trouble keeping it all in my head. You had me until you started mixing a hash with an inode or file_node it goes fuzzy for me. (non CS major) =)

Can you tell me more?

  1. Are we making indexes that reference hard file paths?
  2. Are we making indexes of indexes for speeding up future searches
  3. Are we creating a 'never move or delete hash' that breaks if modified?

@jcjveraa
Copy link

jcjveraa commented Aug 26, 2023

Hi there @datatalking - maybe the confusion starts at what hard links are?

To avoid wrong impressions by the way - the code I added wouldn't be acceptable by my standards if it wasn't a gist (ie a proof of concept / handy tool), not "well written" on my part ;-) my professional code is a lot more readable.

To your question:

A harddisk/ssd in and of itself doesn't contain files, but raw data (bytes). The "filesystem", like ext4 or ntfs or fat32, that we put on top of this disk, helps to assign meaning to sets of bytes saying "a certain file exists, and this file is stored at this and this 'block (of bytes)'". If you search online you'll find that these inode-things have role to play there. Think of this as the "path" of a file, or a hard "link" to the file, like www.github.com is a link to this site.

In case we find there are two files that exist, ie we have two separate hard links (which is the same - a file "exists" because there is at least one hard link to it, else it is simply unreachable data on a hard disk), but we find out by computing the hash of the file that these two files are effectively identical (e.g. you just copied a file using copy-paste in your windows system), my addition will "unlink" (delete) one of the files and replace it with a hard link.

Now the copied file and the original both refer to the same set of bytes on the disk, rather than a real copy of the bytes. This means there is less disk space used. It also means that changing the original also changes the 'copy', as it is no longer a true copy but a reference to the same bytes on disk as the original. That can be not at all what you want, hence this is a little dangerous.

In my usecase though this is exactly what I need as it helps a lot in keeping the size of "read only" backups down. It is easy to "undo" the hard linking when restoring the backup if that is required.

Note: a hard link differs from a "soft link" (or symlink / windows shortcut) in that as long as either of the "files" (= at least one hard link) exists, the data will remain on the disk.
The data is "deleted" when the last hard link to it is 'unlinked'. That fact is desired in my use case - soft links could mean I lose data in the backups if I would delete some file, as in that case deleting the original (unlinking the hard link) results in a broken shortcut and the data would be gone. With multiple hardlinks such as my version of the script creates, the data will be preserved until it is truly no longer required. Google can tell you more on this :-)

@jcjveraa
Copy link

Note, when I say that hardlinking can be undone this is "... by an expert user and then still only to a certain extent". Noting comes for free, and what you will lose/need to restore via other means (if required) is file metadata including file access permissions. If like in my case this is not an issue, then my method works, but your mileage may vary :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment