Python 损坏的希伯来语：另存为ansi-转换回UTF-8_Python_Utf 8_Character Encoding_Hebrew_Codepages

Python 损坏的希伯来语：另存为ansi-转换回UTF-8

python utf-8 character-encoding

Python 损坏的希伯来语：另存为ansi-转换回UTF-8,python,utf-8,character-encoding,hebrew,codepages,Python,Utf 8,Character Encoding,Hebrew,Codepages,我怀疑一些数据（在windows计算机上）已保存为ANSI。因此，最初的希伯来文字丢失了，我们看到的是这样的东西如果知道原始文本是希伯来文，信息是否丢失或是否有可能映射回字符？信息可能没有丢失，或者最多部分丢失。如果要使用Python： import codecs BLOCKSIZE = 1048576 # or some other, desired size in bytes with codecs.open("input.txt", "r", "windows-1255") as s

我怀疑一些数据（在windows计算机上）已保存为ANSI。因此，最初的希伯来文字丢失了，我们看到的是这样的东西


如果知道原始文本是希伯来文，信息是否丢失或是否有可能映射回字符？
信息可能没有丢失，或者最多部分丢失。如果要使用Python：
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open("input.txt", "r", "windows-1255") as sourceFile:
    with codecs.open("output.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
               break
            targetFile.write(contents)

偷来并改编自
您还可以使用外部工具，如iconv：
iconv -f windows-1255 -t utf-8 input.txt > output.txt

Iconv在大多数Linux发行版、Cygwin和其他平台上都可用
如果文件被双重损坏，您可能需要执行以下操作：
iconv -f utf-8 -t windows-1252 input.txt > tmp.txt
iconv -f windows-1255 -t utf-8 tmp.txt > output.txt

但是这种事情发生的可能性很小。
它可能是一个专用的希伯来语代码页，而不是UTF-8。