在python正则表达式中匹配unicode字符_Python_Regex_Unicode_Non Ascii Characters_Character Properties

在python正则表达式中匹配unicode字符

python regex unicode

在python正则表达式中匹配unicode字符,python,regex,unicode,non-ascii-characters,character-properties,Python,Regex,Unicode,Non Ascii Characters,Character Properties,我已经阅读了Stackoverflow的其他问题，但仍然没有更进一步。抱歉，如果这一切都准备好了，我会回答的，但我没有得到任何提议 >>> import re >>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg') >>> print m.groupdict() {'ta

我已经阅读了Stackoverflow的其他问题，但仍然没有更进一步。抱歉，如果这一切都准备好了，我会回答的，但我没有得到任何提议

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

>>重新导入
>>>m=re.match（r'^/by_tag/（？P\w+）/（？P（\w|[，！{}（）]）+）$，'/by_tag/xmas/xmas1.jpg'））
>>>打印m.groupdict（）
{'tag'：'xmas'，'filename'：'xmas1.jpg'}

一切都很好，然后我尝试了一些带有挪威字符的东西（或者更像unicode的东西）：

>>m=re.match（r'^/by_tag/（？P\w+）/（？P（\w|[，！#%{}（）@]）+）$，'/by_tag/Påske/øyfjell.jpg'））
>>>打印m.groupdict（）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
AttributeError:“非类型”对象没有属性“groupdict”

如何匹配典型的unicode字符，如æå？我也希望能够在上面的标记组和文件名组中匹配这些字符。

您需要标志：

m=re.match（r'^/by_tag/（？P\w+）/（？P（\w|[，！{}%{}（）@]）+）$，'/by_tag/Påske/øyfjell.jpg'，re.UNICODE）

您需要指定

re.UNICODE

标志，并使用

前缀将字符串作为UNICODE字符串输入：

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

>>re.match（r'^/by_-tag/（？P\w+）/（？P（\w[，！{}%{}（）@]）+）$，u'/by_-tag/Påske/øyfjell.jpg'，re.UNICODE）。groupdict（）
{'tag'：u'p\xe5ske'，'filename'：u'\xf8yfjell.jpg'}

这是在Python2中；在Python3中，必须省略

，因为所有字符串都是Unicode。

在Python2中，需要标志和字符串构造函数

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
,./___，___-=+

（在后一种情况下，逗号是中文逗号。）

+1 for:并使用u前缀将字符串作为Unicode字符串输入确保输入字符串，因为有不同的代码点序列生成相同的视觉外观。Python3也需要它吗？@Kevin-Python3不需要Unicode标志。“默认情况下，Python 3中已为Unicode（str）模式启用Unicode匹配…”-

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
,./___，___-=+