Python 如何在保留unicode编码字符和撇号（\'；）的同时删除非字母数字字符？_Python_Regex_Unicode_Encoding

Python 如何在保留unicode编码字符和撇号（\'；）的同时删除非字母数字字符？

python regex unicode encoding

Python 如何在保留unicode编码字符和撇号（\'；）的同时删除非字母数字字符？,python,regex,unicode,encoding,Python,Regex,Unicode,Encoding,我有一个文本，我想删除所有非字母数字字符，但保留unicode编码字符和撇号，因为它是单词的一部分，比如was not，cannot，French construction等。我知道我可以做re.sub（ur'\W'，''，text，flags=re.unicode）删除所有非字母数字字符，但我不知道如何做同样的保留撇号。显然，re.sub（ur'[^A-Za-z0-9\'，''，text）不起作用，因为它会去除unicode编码字符。有什么想法吗您可以在字符类中使用字符类速记： re.sub

我有一个文本，我想删除所有非字母数字字符，但保留unicode编码字符和撇号，因为它是单词的一部分，比如was not，cannot，French construction等。我知道我可以做

re.sub（ur'\W'，''，text，flags=re.unicode）

删除所有非字母数字字符，但我不知道如何做同样的保留撇号。显然，

re.sub（ur'[^A-Za-z0-9\'，''，text）

不起作用，因为它会去除unicode编码字符。有什么想法吗

您可以在字符类中使用字符类速记：

re.sub(ur"[^\w']+", "", text, flags=re.UNICODE)

除了使用

re.UNICODE

的

re

之外，如果您使用的是Py2

UNICODE

或Py3

str

，则谓词函数可以识别UNICODE类型。所以你可以做：

# Py2 (convert text to unicode if it isn't already)
if not isinstance(text, unicode):
    text = text.decode("utf-8")  # Or latin-1; whatever encoding you're implicitly assuming
u''.join(let for let in text if let == u"'" or let.isalnum())

# Py3
''.join(let for let in text if let == "'" or let.isalnum())

这几乎肯定比使用

re

要慢，但为了完整起见，我想我还是要提到它。

带撇号的否定字符类中的

\w

怎么样？试试

re.sub（ur“[^\w']+”，“”，text，flags=re.UNICODE）