Python re和regex:search（）不'；t不'；t将相同的字符串与非ASCII字符匹配_Python_Regex_Unicode

Python re和regex:search（）不'；t不'；t将相同的字符串与非ASCII字符匹配

python regex unicode

Python re和regex:search（）不'；t不'；t将相同的字符串与非ASCII字符匹配,python,regex,unicode,Python,Regex,Unicode,试图让re或regex将非ASCII字符串与其自身匹配。我已经阅读了其他关于非ASCII/unicode的帖子，并尝试添加unicode标志，但没有效果： # python Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>

试图让re或regex将非ASCII字符串与其自身匹配。我已经阅读了其他关于非ASCII/unicode的帖子，并尝试添加unicode标志，但没有效果：

# python
Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import regex
>>> s1 = 'wow'
>>> s2 = 'ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ'
>>> print(s2)
ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ
>>> re.search(s1,s1)
<_sre.SRE_Match object at 0x7f0ce27c38b8>
>>> re.search(s2,s2)
>>> type(s2)
<type 'str'>
>>> us2 = unicode(s2,'utf-8')
>>> us2
u'\u211b\u212f\u03b1\u2202\u03b1\u266d\u2113\u212f \u266d\u028a\u2602 \u03b7\u263a\u0442 \u0466$\u263e\u2110\u2110'
>>> re.search(us2,us2,re.UNICODE)
>>> regex.search(s2,s2)
>>> regex.search(us2,us2,regex.UNICODE)
>>>

#python
linux2上的Python 2.7.3（默认值，2012年4月14日，08:58:41）[GCC]
有关详细信息，请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>进口稀土
>>>导入正则表达式
>>>s1=‘哇’
>>>s2=ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ'
>>>打印（s2）
ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ
>>>检索（s1，s1）
>>>再搜索（s2，s2）
>>>类型（s2）
>>>us2=unicode（s2，'utf-8'）
>>>us2
u'\u211b\u212f\u03b1\u2202\u03b1\U2666D\u2113\u212f\u266d\u028a\u2602\u03b7\u263a\u0442\u0466$\u263e\u2110\u2110'
>>>检索（us2，us2，re.UNICODE）
>>>正则表达式搜索（s2，s2）
>>>search（us2，us2，regex.UNICODE）
>>>

我希望我错过了一些明显的东西。非常感谢您的帮助

注意，作为正则表达式模式，

s2

在其内部有一个

at_end

模式

In [62]: re.compile(s2, re.DEBUG)
literal 226
literal 132
literal 155
...
at at_end
...
literal 226
literal 132
literal 144

这是因为，作为utf-8编码字符串，

s2

是

In [61]: s2 = 'ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ'
In [72]: s2
Out[72]: '\xe2\x84\x9b\xe2\x84\xaf\xce\xb1\xe2\x88\x82\xce\xb1\xe2\x99\xad\xe2\x84\x93\xe2\x84\xaf \xe2\x99\xad\xca\x8a\xe2\x98\x82 \xce\xb7\xe2\x98\xba\xd1\x82 \xd1\xa6$\xe2\x98\xbe\xe2\x84\x90\xe2\x84\x90'

请注意，

s2

中有一个

：

In [75]: '$' in s2
Out[75]: True

要防止该

被解释为

在结尾处的模式，请使用re.escape
对模式中的所有非字母数字字符进行转义：
In [67]: pat = re.compile(re.escape(s2))

In [68]: pat.search(s2)
Out[68]: <_sre.SRE_Match at 0x7feb6b44dd98>

In [78]: us2 = unicode(s2,'utf-8')

In [79]: re.search(re.escape(us2), us2)
Out[79]: <_sre.SRE_Match at 0x7feb6b44ded0>

我的猜测是这些角色被视为特殊角色。也许你需要逃离它们？你在us2
中看到$了吗？它肯定会阻止匹配，因为在字符串结束后不能有任何内容。
In [81]: u'$' in us2
Out[81]: True