Python 是否有更好的算法将每个图像与文件夹中的其他图像进行比较？_Python_Python 3.x_List_Algorithm_Image Processing

Python 是否有更好的算法将每个图像与文件夹中的其他图像进行比较？

python python-3.x list algorithm image-processing

Python 是否有更好的算法将每个图像与文件夹中的其他图像进行比较？,python,python-3.x,list,algorithm,image-processing,Python,Python 3.x,List,Algorithm,Image Processing,我正在使用Python 3.8.1创建用于组织图像的工具。其中一个工具是检测和分离相似的图像。所以，我需要一个算法来将每个图像与该文件夹中的所有其他图像进行比较，更准确地说，我需要一个更好的算法来将每个图像与该文件夹中的所有其他图像进行比较。我解决这个问题的方法如下：让我们以6幅图像为例。我们将图像命名为1、2、3、4、5和6。这是每个图像与该文件夹中所有图像之间的所有可能比较： 1 -> 1 2 -> 1 3 -> 1 4 -> 1 5 -> 1 6->1 1 ->

我正在使用Python 3.8.1创建用于组织图像的工具。其中一个工具是检测和分离相似的图像。所以，我需要一个算法来将每个图像与该文件夹中的所有其他图像进行比较，更准确地说，我需要一个更好的算法来将每个图像与该文件夹中的所有其他图像进行比较。

我解决这个问题的方法如下：
让我们以6幅图像为例。我们将图像命名为1、2、3、4、5和6。这是每个图像与该文件夹中所有图像之间的所有可能比较：
1 -> 1 2 -> 1 3 -> 1 4 -> 1 5 -> 1 6->1
1 -> 2 2 -> 2 3 -> 2 4 -> 2 5 -> 2 6->2
1 -> 3 2 -> 3 3 -> 3 4 -> 3 5 -> 3 6->3
1 -> 4 2 -> 4 3 -> 4 4 -> 4 5 -> 4 6->4
1 -> 5 2 -> 5 3 -> 5 4 -> 5 5 -> 5 6->5
1 -> 6 2 -> 6 3 -> 6 4 -> 6 5 -> 6 6->6
有6*6=36个比较。
接下来，因为我们要比较每个图像以找到相似的图像，所以排除自身之间的比较是合乎逻辑的，所以我们需要删除比较1->1、2->2，依此类推。我们还需要排除两次比较图像，例如，比较1->2，然后再次比较2->1。从逻辑上讲，比较“我”和“你”与比较“你”和“我”有什么区别。如果“我”和“你”真的不同，那么你就不需要再比较了。
因此，其余的比较是：
1 -> 2
1 -> 3 2 -> 3
1 -> 4 2 -> 4 3 -> 4
1 -> 5 2 -> 5 3 -> 5 4->5
1 -> 6 2 -> 6 3 -> 6 4 -> 6 5->6
这将要比较的图像总数减少到5+4+3+2+1=15个，少于原始图像的一半。
我使用一种方法来实现这一点，该方法获取该文件夹中所有图像的列表，然后根据上面的逻辑返回两个图像对的列表。方法如下：

def get_cmpr_pairs_list( self ):
    cmprPairsList = []
    for i in range(0, self.imgCount):
        for j in range(i+1, self.imgCount):
            cmprPairsList.append( [self.filenames[i], self.filenames[j]] )
    return cmprPairsList

考虑到这还不够，我使用

多处理

模块将比较这些图像对和所有CPU核心（8核CPU）的任务分开。这是我创建的用于比较文件夹中所有图像的方法：

def find_similars_all( self ):
    print("Finding similar images...\n")

    similarImg = []         # a list that stores similar images
    cmprPairsList = self.get_cmpr_pairs_list()       # get the comparison pairs list
    self.cmprPairsCount = len(cmprPairsList)

    # create a multiprocess pool
    with multiprocessing.get_context("spawn").Pool() as p :
        print("Total images to compare : {} images\n".format(self.cmprPairsCount))
        # find all similar image and add it to similarImg
        similarImg = p.map(self.compare_img, cmprPairsList)
        # remove all None from the list
        similarImg = list(filter(None, similarImg))
        # melt all the list result into a single list
        # from [[], [], ...] into [ ... ]
        similarImg = itertools.chain.from_iterable(similarImg)
        # remove duplicates
        similarImg = list(dict.fromkeys(similarImg))

    # if we have found some similar images...
    if similarImg:
        self.move_filenames(similarImg, "Similars")
    else :
        print("\nNo similar images found.")

    print("\nDone searching for similar images!")
    self.update_filenames()

最后一个，这是每个进程将运行以比较每个图像的方法。如果图像对相似，它将返回图像对：

def compare_img( self, imagePair ):
    # get the pair of image filename that we want to compare
    imgA, imgB = imagePair[0], imagePair[1]
    imgA = os.path.join(self.dirname, imgA)
    imgB = os.path.join(self.dirname, imgB)
    
    # ------ just making a counter for the image comparisons ------
    global counter
    counter.increment()
    print("{}/{} images compared...".format(counter.value(), self.cmprPairsCount), end="\r")
    # ------ just making a counter for the image comparisons ------

    # ---- comparison algorithm starts here ----
    # add the threshold
    threshold = 1 - self.similarity_percentage/100
    diff_limit = int(threshold*(self.hash_size**2))
    
    # create the average hash of the image
    with Image.open(imgA) as img:
        hash1 = imagehash.average_hash(img, self.hash_size).hash
    
    with Image.open(imgB) as img:
        hash2 = imagehash.average_hash(img, self.hash_size).hash
    
    result = np.count_nonzero(hash1 != hash2) <= diff_limit
    # ---- comparison algorithm stops here ----

    # this part will conclude whether the two image is similar or not
    if result:
        print("{} image found {}% similar to {}".format(imgA, self.similarity_percentage, imgB))
        return imagePair

def compare_img（self，imagePair）：
#获取要比较的图像文件名对
imgA，imgB=imagePair[0]，imagePair[1]
imgA=os.path.join（self.dirname，imgA）
imgB=os.path.join（self.dirname，imgB）
#------只是做一个计数器用于图像比较------
全局计数器
counter.increment（）
打印（“{}/{}已比较的图像…”。格式（counter.value（），self.cmprPairsCount），end=“\r”）
#------只是做一个计数器用于图像比较------
#----比较算法从这里开始----
#添加阈值
阈值=1-自相似性百分比/100
diff_limit=int（阈值*（自散列大小**2））
#创建图像的平均哈希值
使用Image.open（imgA）作为img：
hash1=imagehash.average\u散列（img，self.hash\u size）.hash
使用Image.open（imgB）作为img：
hash2=imagehash.average\u散列（img，self.hash\u size）.hash
result=np.count_nonzero（hash1！=hash2）不计算每次比较的哈希值。打开每个文件，计算散列，存储它。然后比较存储的所有组合的哈希值。
你能告诉我更多关于这行的操作吗`hash1=imagehash.average_hash（img，self.hash_size）.hash`？你为什么不先计算所有哈希值，然后检查不同哈希值的数量？@Surt-关于哈希值或比较算法，我从一个网站上复制了它，所以我不知道它是如何得出图像相似与否的结论的。但我记得它是通过取图像的平均哈希值，然后对这些值进行异或运算（我不记得了）。@Abhinav Mathur，我想我可以尝试对所有图像进行哈希运算，将这些哈希值存储在图像对列表中，然后让cpu计算每对散列之间的差值。你不需要比较，因为字典只会为每个计算的散列存储一个图像id是的，我想我可以试试。计算所有哈希，成对存储，然后让CPU计算所有这些哈希对。我会告诉你结果的，谢谢你的建议！对于同一组图像，操作从1小时变为一瞬间。我从来没想到会这么快，哈哈。不客气。这是优化时要做的第一件事——搜索已经完成的重复工作，并思考如何只做一次。