MD5返回不同的哈希代码-Python_Python_Hash_Md5

MD5返回不同的哈希代码-Python

python hash

MD5返回不同的哈希代码-Python,python,hash,md5,Python,Hash,Md5,我试图确定一些文件的数据一致性。然而，MD5的实现方式一直不同。当我执行md5sum时，哈希值相等： import hashlib import os import sys def hash_file_content(path): try: if not os.path.exists(path): raise IOError, "File does not exist" encode = hashlib.md5(path).hex

我试图确定一些文件的数据一致性。然而，MD5的实现方式一直不同。当我执行

md5sum

时，哈希值相等：

import hashlib
import os
import sys

def hash_file_content(path):
    try:
        if not os.path.exists(path):
            raise IOError, "File does not exist"
        encode = hashlib.md5(path).hexdigest()
        return encode
    except Exception, e:
        print e

def main():
    hash1 = hash_file_content("./downloads/sample_file_1")
    hash2 = hash_file_content("./samples/sample_file_1")

    print hash1, hash2

if __name__ == "__main__":
   main()

输出出乎意料地不同：

baed6a40f91ee5c44488ecd9a2c6589e 490052e9b1d3994827f4c7859dc127f0

现在使用

md5sum

：

md5sum ./samples/sample_file_1
9655c36a5fdf546f142ffc8b1b9b0d93  ./samples/sample_file_1

md5sum ./downloads/sample_file_1 
9655c36a5fdf546f142ffc8b1b9b0d93  ./downloads/sample_file_1

为什么会发生这种情况？我如何解决这个问题？

在您的代码中，您计算的是文件路径的

md5

，而不是文件的内容：

...
encode = hashlib.md5(path).hexdigest()
...

而是计算文件内容的md5：

with open(path, "r") as f:
    encode = md5(f.read()).hexdigest()

这将为您提供匹配的输出（即，相互匹配，与

md5sum

的输出相同）

由于文件大小很大，单次执行

f.read（）

会非常繁重，并且当文件大小超过可用内存时根本无法工作

因此，取而代之的是，利用md5在内部使用其更新方法计算块上的散列，并定义一个使用

md5.update

的方法，并在代码中调用它，如中所述：

现在在您的代码中调用：

encode = md5_for_file(path)

文件很大，有些是千兆字节的。。。f、 read（）是正确的使用方法，或者只是文件的一个标记，比如说1024？@philippe f.read（）会太累人。使用这里提到的方法->stackoverflow.com/a/1131255/1860929，将其编辑到我的答案中。

encode = md5_for_file(path)