Python 剥离非字符的unicode文本_Python_Unicode

Python 剥离非字符的unicode文本

python unicode

Python 剥离非字符的unicode文本,python,unicode,Python,Unicode,我正在尝试编写一个简单的Python脚本，它将文本文件作为输入，删除所有非文字字符，并将输出写入另一个文件。通常我会用两种方法：使用与re.sub组合的正则表达式将每个非字母字符替换为空字符串检查每行中的每个字符，仅当它位于string.lowercase 但是这次的文本是意大利语的神曲（我是意大利人），所以有一些Unicode字符，比如 èéï 还有其他一些。我写了#-*-编码：utf-8-*-作为脚本的第一行，但我得到的是，当脚本中写入Unicode字符时，Python不会发出错

我正在尝试编写一个简单的Python脚本，它将文本文件作为输入，删除所有非文字字符，并将输出写入另一个文件。通常我会用两种方法：

使用与
```
re.sub
```
组合的正则表达式将每个非字母字符替换为空字符串
检查每行中的每个字符，仅当它位于
```
string.lowercase
```

但是这次的文本是意大利语的神曲（我是意大利人），所以有一些Unicode字符，比如

èéï

还有其他一些。我写了

#-*-编码：utf-8-*-

作为脚本的第一行，但我得到的是，当脚本中写入Unicode字符时，Python不会发出错误信号

然后我尝试在正则表达式中包含Unicode字符，例如：

u'\u00AB'

这似乎是可行的，但是Python在从文件中读取输入时，并没有按照读取的方式重写它所读取的内容。例如，某些字符被转换为平方根符号

我该怎么办

import codecs
f = codecs.open('FILENAME', encoding='utf-8')
for line in f:
    print repr(line)
    print line

一,。将为您提供Unicode格式
2.将按照您的文件中所写的内容给您

希望它能帮助您：）

将返回该代码点的类别

您可以在中找到类别的说明，但与您相关的是L、N、p、Z和S组：

您可能还希望首先规范化字符串，以便可以附加到字母的变音符号可以这样做：

unicodedata.normalize（表单，unistr）

返回Unicode字符串unistr的标准格式。表单的有效值为“NFC”、“NFKC”、“NFD”和“NFKD”

综上所述：

file_bytes = ...   # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
    'Ll', 'Lu', 'Lt', 'Lm', 'Lo',  # Letters
    'Nd', 'Nl',                    # Digits
    'Po', 'Ps', 'Pe', 'Pi', 'Pf',  # Punctuation
    'Zs'                           # Breaking spaces
])
filtered_text = ''.join(
    [ch for ch in normalized_text
     if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8')  # ready to be written to a file

看起来Python将该文件视为ISO 8859-15（8859-1）而不是UTF-8。您必须解决如何在文件句柄上设置UTF-8属性，以便Python知道它应该这样对待它。我不知道细节；不过，我相信这是可以做到的。（请注意，“èèèèèè”映射到0xC3 0xA8=U+00E8、0xC3 0xA9=U+00E9、0xC3 0xAF=U+00EF，但如果您将字节序列0xE8、0xE9、0xEF视为UTF-8，则它不是一个有效的序列。每个字节后面都应该有0x80..0xBF范围内的3个字节才能成为有效的UTF-8。换句话说，您的文档是否为Unicode“非文字字符”？@Mike Samuel我指的是任何数字符号标点字符。我仍然得到一个：

UnicodeEncodeError:“ascii”编解码器无法将字符u'\xab'编码在位置0:ordinal不在范围内（128）

。完美的解决方案，我在发布之前快速检查了

unicodedata

，但没有找到任何东西。非常感谢。

file_bytes = ...   # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
    'Ll', 'Lu', 'Lt', 'Lm', 'Lo',  # Letters
    'Nd', 'Nl',                    # Digits
    'Po', 'Ps', 'Pe', 'Pi', 'Pf',  # Punctuation
    'Zs'                           # Breaking spaces
])
filtered_text = ''.join(
    [ch for ch in normalized_text
     if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8')  # ready to be written to a file