Utf 8 使用Python2但不使用python3的脚本（hashlib）_Utf 8_Python 3.x_Md5_Python 2.x_Hashlib

Utf 8 使用Python2但不使用python3的脚本（hashlib）

utf-8 python-3.x

Utf 8 使用Python2但不使用python3的脚本（hashlib）,utf-8,python-3.x,md5,python-2.x,hashlib,Utf 8,Python 3.x,Md5,Python 2.x,Hashlib,今天，我在一个简单的脚本中使用所有可用的hashlib算法（md5、sha1…）对文件进行校验和，我编写了这个脚本，并用Python2进行了调试，但当我决定将它移植到python3时，它就无法工作了。有趣的是，它适用于小文件，但不适用于大文件。我原以为缓冲文件的方式有问题，但错误消息让我认为这与我执行hexdigest的方式有关（我想）这里是我整个脚本的副本，所以请随意复制、使用它并帮助我找出问题所在。检查250 MB文件时出现的错误是 “'utf-8'编解码器无法解码位置10中的字节0xf3:

今天，我在一个简单的脚本中使用所有可用的hashlib算法（md5、sha1…）对文件进行校验和，我编写了这个脚本，并用Python2进行了调试，但当我决定将它移植到python3时，它就无法工作了。有趣的是，它适用于小文件，但不适用于大文件。我原以为缓冲文件的方式有问题，但错误消息让我认为这与我执行hexdigest的方式有关（我想）这里是我整个脚本的副本，所以请随意复制、使用它并帮助我找出问题所在。检查250 MB文件时出现的错误是

“'utf-8'编解码器无法解码位置10中的字节0xf3:无效的连续字节”

我用谷歌搜索它，但找不到任何可以修复它的东西。此外，如果你看到更好的方法来优化它，请让我知道。我的主要目标是使Python 3 100%工作。谢谢

#!/usr/local/bin/python33
import hashlib
import argparse

def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
    algorithmType = getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
    #Open file and extract data in chunks   
    for path in filepaths:
        try:
            with open(path) as f:
                while True:
                    dataChunk = f.read(blockSize)
                    if not dataChunk:
                        break
                    algorithmType.update(dataChunk.encode())
                yield algorithmType.hexdigest()
        except Exception as e:
            print (e)

def main():
    #DEFINE ARGUMENTS
    parser = argparse.ArgumentParser()
    parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
    parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5", 
                        help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
    arguments = parser.parse_args()
    algo = arguments.algorithm
    if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):

这是在Python2中工作的代码，我将把它放在您想使用它而不必修改上面的代码的情况下

#!/usr/bin/python
import hashlib
import argparse

def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
    '''
    Hashes a file. In oder to reduce the amount of memory used by the script, it hashes the file in chunks instead of putting
    the whole file in memory
    ''' 
    algorithmType = hashlib.new(algorithm)  #getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
    #Open file and extract data in chunks   
    for path in filepaths:
        try:
            with open(path, mode = 'rb') as f:
                while True:
                    dataChunk = f.read(blockSize)
                    if not dataChunk:
                        break
                    algorithmType.update(dataChunk)
                yield algorithmType.hexdigest()
        except Exception as e:
            print e

def main():
    #DEFINE ARGUMENTS
    parser = argparse.ArgumentParser()
    parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
    parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5", 
                        help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
    arguments = parser.parse_args()
    #Call generator function to yield hash value
    algo = arguments.algorithm
    if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
        for hashValue in hashFile(algo, arguments.filepaths):
            print hashValue
    else:
        print "Algorithm {0} is not available in this script".format(algorithm)

if __name__ == "__main__":
    main()

我还没有在Python3中尝试过，但是在Python2.7.5中，对于二进制文件，我得到了相同的错误（唯一的区别是，我的是使用ascii编解码器）。不要对数据块进行编码，而是直接以二进制模式打开文件：

with open(path, 'rb') as f:
    while True:
        dataChunk = f.read(blockSize)
        if not dataChunk:
            break
        algorithmType.update(dataChunk)
    yield algorithmType.hexdigest()

除此之外，我将使用该方法而不是

getattr

，并检查参数是否有效。

我没有在Python 3中尝试过它，但在Python 2.7.5中，对于二进制文件，我得到了相同的错误（唯一的区别是，我的是使用ascii编解码器）。不要对数据块进行编码，而是直接以二进制模式打开文件：

with open(path, 'rb') as f:
    while True:
        dataChunk = f.read(blockSize)
        if not dataChunk:
            break
        algorithmType.update(dataChunk)
    yield algorithmType.hexdigest()

除此之外，我将使用该方法而不是

getattr

，并检查参数是否有效。

谢谢，我将避免使用hashlib.algorithms\u，因为它仅在3.2之后可用，在python2中不可用，但hashlib.new确实看起来更干净。顺便说一句，我收到了您提到的错误，但我通过删除.encode（）解决了它在更新“dataChunk”时，我添加的第二个脚本应该可以工作you@JuanCarlos它不会给出任何错误，但是它会返回一个不正确的md5sum，除非您以二进制模式打开文件。这很奇怪，在我的系统中，它以任何一种方式工作。我在Linux中使用脚本和md5sum工具对10个文件进行了检查，得到了完全相同的结果，与sha1的结果相同。无论如何，我在我的系统中工作，但为了安全起见，我将以二进制模式打开它。谢谢谢谢，我将避免hashlib.algorithms_可用，因为它只在3.2之后才可用，在python2中不可用，但是hashlib.new确实看起来更干净。顺便说一句，我遇到了您提到的错误，但我在更新“数据块”时删除了.encode（）部分，从而解决了这个问题我添加的第二个脚本应该适用于you@JuanCarlos它不会给出任何错误，但是它会返回一个不正确的md5sum，除非您以二进制模式打开文件。这很奇怪，在我的系统中，它以任何一种方式工作。我在Linux中使用脚本和md5sum工具对10个文件进行了检查，得到了完全相同的结果，与sha1的结果相同。无论如何，我在我的系统中工作，但为了安全起见，我将以二进制模式打开它。谢谢