Python Office Automation: Remove duplicate images from computers in batches

author:Artificial intelligence learns from people

In the digital age, we often encounter a large accumulation of image files, many of which are duplicate pictures. These duplicate images not only take up valuable storage space, but also make file management and backup cumbersome. To solve this problem, we can make use of Python to write a script to automatically detect and remove duplicate image files.


Before we start writing code, we need to make sure that we have installed the Python environment and installed a library for comparing image similarity, such as imagehash and Pillow (a branch of PIL). These two libraries can help us generate hashes of images and compare the similarity of images based on these hashes.

You can use pip to install both libraries:

pip install imagehash pillow           

Write code

Here's a simple Python script to find and remove duplicate image files:

import os  
import imagehash  
from PIL import Image  
from itertools import combinations  
def image_hash(file_path):  
    hash_object = imagehash.average_hash(  
    return hash_object  
def find_duplicate_images(directory):  
    images = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f)) and f.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp'))]  
    hashes = {f: image_hash(os.path.join(directory, f)) for f in images}  
    duplicates = []  
    for img1, img2 in combinations(images, 2):  
        if hashes[img1] - hashes[img2] < 5:  # 设定一个阈值来判断是否相似  
            duplicates.append((img1, img2))  
    return duplicates  
def remove_duplicates(directory, duplicates):  
    for img1, img2 in duplicates:  
        # 假设我们保留第一个文件,删除第二个文件  
        os.remove(os.path.join(directory, img2))  
        print(f"Removed duplicate: {img2}")  
# 使用示例  
directory = 'path/to/your/image/directory'  # 替换为你的图像目录  
duplicates = find_duplicate_images(directory)  
remove_duplicates(directory, duplicates)           

Note: This script uses a simple average hash algorithm to compare images and sets a threshold (5 in this case) to determine if the two hashes are close enough to consider the images similar. However, the accuracy of this method may vary depending on the quality of the image and the complexity of the content. You may need to adjust this threshold based on your specific needs.

In addition, this script removes the second image among all similar images, leaving the first image. If you want more complex logic (e.g., only deleting the exact same images, or choosing which images to keep based on a certain strategy), you'll need to add more logic to the remove_duplicates function.

Finally, this script doesn't provide the option to back up or undelete it, so make sure to back up your image files before running the script to prevent accidental deletion of important files.

Code in action

Python Office Automation: Remove duplicate images from computers in batches

I have a batch of images, and each image is backed up twice, using the above code to identify duplicate images

Python Office Automation: Remove duplicate images from computers in batches

When you delete an image, all the images are left with only one original image

Python Office Automation: Remove duplicate images from computers in batches

Read on