Python 在数千个文件夹中查找图像相似性_Python

Python 在数千个文件夹中查找图像相似性

python

Python 在数千个文件夹中查找图像相似性,python,Python,我已经拼凑/编写了一些代码（感谢stackoverflow用户！），使用imagehash检查图像中的相似性，但现在我在检查数千张图像（大约16000张）时遇到了问题。有什么我可以改进的代码（或者完全不同的路线）可以更准确地找到匹配项和/或减少所需时间吗？谢谢我首先将创建的列表更改为itertools组合，因此它只比较图像的唯一组合 new_loc = os.chdir(r'''myimagelocation''') dirloc = os.listdir(r'''myimagelocatio

我已经拼凑/编写了一些代码（感谢stackoverflow用户！），使用imagehash检查图像中的相似性，但现在我在检查数千张图像（大约16000张）时遇到了问题。有什么我可以改进的代码（或者完全不同的路线）可以更准确地找到匹配项和/或减少所需时间吗？谢谢

我首先将创建的列表更改为itertools组合，因此它只比较图像的唯一组合

new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')

duplicates = []
dup = []

for f1, f2 in itertools.combinations(dirloc,2):
    #Honestly not sure which hash method to use, so I went with dhash.
    dhash1 = imagehash.dhash(Image.open(f1))
    dhash2 = imagehash.dhash(Image.open(f2))
    hashdif = dhash1 - dhash2


    if hashdif < 5:  #May change the 5 to find more accurate matches
            print("images are similar due to dhash", "image1", f1, "image2", f2)
            duplicates.append(f1)
            dup.append(f2)

    #Setting up a CSV file with the similar images to review before deleting
    with open("duplicates.csv", "w") as myfile:
        wr = csv.writer(myfile)
        wr.writerows(zip(duplicates, dup))

new_loc=os.chdir（r''myimagelocation'）
dirloc=os.listdir（r''myimagelocation'）
重复项=[]
dup=[]
对于itertools组合中的f1和f2（dirloc，2）：
#老实说，我不确定使用哪种散列方法，所以我选择了dhash。
dhash1=imagehash.dhash（Image.open（f1））
dhash2=imagehash.dhash（Image.open（f2））
hashdif=dhash1-dhash2
如果hashdif<5:#可以更改5以查找更精确的匹配
打印（“由于dhash”、“图像1”、f1、“图像2”和f2，图像相似）
重复。追加（f1）
重复追加（f2）
#设置具有相似图像的CSV文件以在删除之前查看
打开（“duplicates.csv”、“w”）作为myfile：
wr=csv.writer（myfile）
wr.writerows（zip（重复，重复））

目前，此代码可能需要几天才能处理文件夹中的图像数。如果可能的话，我希望能把时间减少到几个小时。

试试这个，不要在比较时对每个图像进行散列（127992000个散列），而是提前散列并比较散列，因为它们不会改变（16000个散列）

new_loc=os.chdir（r''myimagelocation'）
dirloc=os.listdir（r''myimagelocation'）
重复项=[]
dup=[]
散列=[]
对于dirloc中的文件：
append（（文件，imagehash.dhash（Image.open（文件）））
对于itertools.组合中的pair1和pair2（散列，2）：
f1，dhash1=pair1
f2，dhash2=pair2
#老实说，我不确定使用哪种散列方法，所以我选择了dhash。
hashdif=dhash1-dhash2
如果hashdif<5:#可以更改5以查找更精确的匹配
打印（“由于dhash”、“图像1”、f1、“图像2”和f2，图像相似）
重复。追加（f1）
重复追加（f2）
#设置具有相似图像的CSV文件以在删除之前查看
使用open（“duplicates.csv”、“w”）作为myfile:#也将其移出循环，这样您就不会每次都重写该文件
wr=csv.writer（myfile）
wr.writerows（zip（重复，重复））

正如提示一样，带有open（“duplicates.csv”，“w”）作为myfile:的行

将在每个循环中覆盖duplicates.csv文件，因此您应该将其移出循环，以获得好的捕获！谢谢什么样的图像——内容？图形、线条画还是真实世界的图像？彩色还是灰度？文件有多大？看起来您正在针对每个图像对每个其他图像进行哈希计算。您可以只计算一次每个图像的哈希值，并将其存储在数据库或图像元数据中。然后使用散列值进行比较。这样，就不会对每个图像多次计算哈希值。还可以使用其他哈希。但他们可能会更慢，以获得更好的质量匹配。这看起来很完美，减少了所需的计算数量，但我收到了一个错误。image=image.convert（“L”）.resize（（hash\u size+1，hash\u size），image.ANTIALIAS）AttributeError:'str'对象没有属性'convert'，哪行代码引发此错误？imagehash.dhash（Image.open（file））
行？是的，那一行。hashes.append（（文件，imagehash.dhash（文件）））在您提供的示例中对图像进行哈希处理有效吗？我在一个较小的图像文件夹中对其进行了测试，以验证是否会找到重复的图像。这并不像我希望的那样准确，但这只是一个开始。