带Ø的Python正则表达式ÆÅ；信件_Python_Regex_Encoding_Python 2.x

带Ø的Python正则表达式ÆÅ；信件

python regex encoding

带Ø的Python正则表达式ÆÅ；信件,python,regex,encoding,python-2.x,Python,Regex,Encoding,Python 2.x,我是Python新手，所以这看起来很容易。我正在尝试删除所有#数字，如果同一个字母连续重复两次以上，我需要将其更改为仅两个字母。这项工作做得很好，但不符合要求你知道怎么用字母来做吗我现在得到的结果是： Phone Num : ånd ånd ååååånd dd flløde... :)asd 我想要的是： Phone Num : ånd ånd åånd dd flløde... :)asd 您需要使用Unicode值，而不是字节字符串。UTF-8编码的å是两个字节，匹配\w的正则

我是Python新手，所以这看起来很容易。我正在尝试删除所有#数字，如果同一个字母连续重复两次以上，我需要将其更改为仅两个字母。这项工作做得很好，但不符合要求

你知道怎么用字母来做吗

我现在得到的结果是：

Phone Num : ånd ånd ååååånd dd flløde... :)asd

我想要的是：

Phone Num : ånd ånd åånd dd flløde... :)asd

您需要使用Unicode值，而不是字节字符串。UTF-8编码的

å

是两个字节，匹配

\w

的正则表达式在默认非Unicode感知模式下运行时仅匹配ascii字母、数字和下划线

从上的

\w

：

未指定

区域设置

和

UNICODE

标志时，匹配任何字母数字字符和下划线；这相当于设置

[a-zA-Z0-9.]

。使用

LOCALE

，它将匹配集合

[0-9\

以及当前LOCALE中定义为字母数字的任何字符。如果设置了

UNICODE

，这将匹配字符

[0-9]

以及UNICODE字符属性数据库中分类为字母数字的任何字符

不幸的是，即使您切换到正确使用Unicode值（使用Unicode

u'

文字或通过将源数据解码为Unicode值）、使用Unicode正则表达式（

re.sub（ur'…'）

）并使用

re.Unicode

标志切换

\w

以匹配Unicode字母数字字符，Python

re

模块对Unicode匹配的支持仍然有限：

>>> print re.sub(ur'(\w)\1+', r'\1\1', text, re.UNICODE)
ånd ånd ååååånd dd flløde... :)asd

因为

å

未被识别为字母数字：

>>> print re.sub(ur'\w', '', text, re.UNICODE)
å å ååååå  ø... :)

解决方案是使用external，它是

re

库的一个版本，添加了适当的完全Unicode支持：

>>> import regex
>>> print regex.sub(ur'(\w)\1+', r'\1\1', text, re.UNICODE)
ånd ånd åånd dd flløde... :)asd

该模块不仅可以识别Unicode值中的更多字母数字字符，有关更多详细信息，请参阅链接包页面

变化：

text = u"ån9d ånd åååååååånd d9d flllllløde... :)asd "

及

竞争解决方案

import math, re, sys, os, codecs
reload(sys)
sys.setdefaultencoding('utf-8')
text = u"ån9d ånd åååååååånd d9d flllllløde... :)asd "

# Remove anything other than digits
text = re.sub(r'#', "", text)
text = re.sub(r"\d", "", text)
text = re.sub(r'(\w)\1+', r'\1\1', text)
text = re.sub(r'(\W)\1+', r'\1\1', text)
print "1: "+ text

打印：

1: ånd ånd åånd dd flløde.. :)asd

我们以前谈过这个，不是吗？使用Unicode，而不是字节字符串。根据我的：在Python 2中，您将使用[Unicode字符串示例]，注意字符串上的前导u前缀和[regular expression with the re.Unicode set]。您好@MartijnPieters，通过查看您的注释，尝试一些我确实找到了解决方案的方法。还有一个选项；请注意，您现在正在将

…

更改为

。

，但这可能适合您的需要。

import math, re, sys, os, codecs
reload(sys)
sys.setdefaultencoding('utf-8')
text = u"ån9d ånd åååååååånd d9d flllllløde... :)asd "

# Remove anything other than digits
text = re.sub(r'#', "", text)
text = re.sub(r"\d", "", text)
text = re.sub(r'(\w)\1+', r'\1\1', text)
text = re.sub(r'(\W)\1+', r'\1\1', text)
print "1: "+ text

1: ånd ånd åånd dd flløde.. :)asd