Python将二进制文件转换为字符串，同时忽略非ascii字符_Python_Non Ascii Characters

Python将二进制文件转换为字符串，同时忽略非ascii字符

python

Python将二进制文件转换为字符串，同时忽略非ascii字符,python,non-ascii-characters,Python,Non Ascii Characters,我有一个二进制文件，我想提取所有ascii字符，而忽略非ascii字符。目前我有： with open(filename, 'rb') as fobj: text = fobj.read().decode('utf-16-le') file = open("text.txt", "w") file.write("{}".format(text)) file.close 但是，我在写入文件UnicodeEncodeError时遇到一个错误：“ascii”编解码器无法对位置

我有一个二进制文件，我想提取所有ascii字符，而忽略非ascii字符。目前我有：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

但是，我在写入文件

UnicodeEncodeError时遇到一个错误：“ascii”编解码器无法对位置0:序号不在范围（128）

中的字符u'\xa0'进行编码。如何让Python忽略非ascii？

使用内置的ascii编解码器并告诉它忽略任何错误，如：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

您可以在Python解释器中测试并使用它：

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

仅尝试转换为字符串就会引发异常

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

基本上，ASCII表接受[0,27]范围内的值，并将其与（可写或不可写）字符相关联。因此，要忽略非ASCII字符，您只需忽略代码不包含在[0,27]中的字符，即低于或等于127的字符

在python中，有一个名为

ord

的函数，它对应于docstring

返回一个字符串的整数序号

换句话说，它为您提供字符的代码。现在，您必须忽略所有传递到

ord

，返回大于128的值的字符。这可以通过以下方式完成：

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

打开（文件名为“rb”）作为fobj的

：
text=fobj.read（）.decode（'utf-16-le'）
out_file=open（“text.txt”、“w”）
#检查“text”的每个字符`
对于文本中的字符：
#如果是ascii字符
如果ord（字符）<128：
输出文件。写入（字符）
out\u file.close

现在，如果您只想保留可打印字符，则必须注意所有字符（至少在ASCII表中）都在32（空格）和126（波浪号）之间，因此您只需执行以下操作：

if 32 <= ord(character) <= 126:

如果32您确定文件中没有unicode字符吗？看起来您的输入文件被编码为utf-16-le，因此您应该在打开文件时指定该编码。在Python 2中您需要使用，但在Python 3中您可以使用普通的内置字符。Python认为ascii的值是否有一个预定范围？输出为仍在提取字符，如SOH、ACK（不确定这些字符是什么，我只是按它们在升华文本中出现的样子键入它们）。@VeraWang SOH和ACK是ASCII。范围是0到127，分别是1和6。@VeraWang--ASCII字符0..31是不可打印的（包括这两个字符，请参阅本维基百科页面上有关ASCII-的图表）如果这不能满足您的需要，那么关于您试图解决的实际问题的更多信息可能会很有用……因此，如果我只想要ASCII可打印字符[32127]，它是一个简单的ord（char）<128和ord（char）>31
？@VeraWang几乎（127不可打印），尽管31
更简单。@Verawan差不多就这样了！您忘记了127是删除字符，不可打印，所以间隔现在是关闭的[32226]：ord（character）=32
或更改为32，如果ord（character）>=32和ord（character），则继续这样做
with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

if 32 <= ord(character) <= 126: