python re：如何匹配字母字符_Python_Regex_Unicode_Regex Negation

python re：如何匹配字母字符

python regex unicode

python re：如何匹配字母字符,python,regex,unicode,regex-negation,Python,Regex,Unicode,Regex Negation,如何将alpha字符与正则表达式匹配。我想要一个在\w中但不在\d中的字符。我希望它与unicode兼容，这就是为什么我不能使用[a-zA-Z]那么： \p{L} 您可以使用此文档作为参考：编辑：。查看此链接：（不再活动，链接到internet存档）其他参考资料：对于子孙后代，以下是博客上的例子： import re string = 'richÃ©' print string richÃ© richre = re.compile('([A-z]+)') match = r

如何将alpha字符与正则表达式匹配。我想要一个在

\w

中但不在

\d

中的字符。我希望它与unicode兼容，这就是为什么我不能使用

[a-zA-Z]

那么：

\p{L}

您可以使用此文档作为参考：

编辑：。查看此链接：（不再活动，链接到internet存档）

其他参考资料：

对于子孙后代，以下是博客上的例子：

import re
string = 'richÃ©'
print string
richÃ©

richre = re.compile('([A-z]+)')
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('(\w+)',re.LOCALE)
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('([Ã©\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9-\xf8\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

string = 'richÃ©Ã±'
match = richre.match(string)
print match.groups()
('rich\xe9\xf1',)

richre = re.compile('([\u00E9-\u00F8\w]+)')
print match.groups()
('rich\xe9\xf1',)

matched = match.group(1)
print matched
richÃ©Ã±

你的前两句话互相矛盾。“在

\w

中但不在

\d

中”包括下划线。我从你的第三句话中假设你不想要下划线

在信封背面使用维恩图会有所帮助。让我们看看我们不想要的东西：

（1）与

\w

不匹配的字符（即不需要任何非字母、数字或下划线的字符）=>

\w

（2）数字=>

\d

（3）下划线=>

。

所以我们不想要的是字符类中的任何东西，因此我们想要的是字符类中的任何东西

下面是一个简单的示例（Python2.6）

进一步的探索揭示了这种方法的一些怪癖：

>>> import unicodedata as ucd
>>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
>>> for x in allsorts:
...     print repr(x), ucd.category(x), ucd.name(x)
...
u'\u0473' Ll CYRILLIC SMALL LETTER FITA
u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
u'\u3020' So POSTAL MARK FACE
u'\u3021' Nl HANGZHOU NUMERAL ONE
>>> rx.findall(allsorts)
[u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']

U+3021（杭州数字1）被视为数字（因此它匹配\w），但Python似乎将“数字”解释为“十进制数字”（类别Nd），因此它不匹配\d

U+2438（带圆圈的拉丁文小写字母Y）不匹配\w

所有CJK表意文字均归类为“字母”，因此匹配\w

无论上述3点中的任何一点是否值得关注，该方法都是您从当前发布的re模块中获得的最佳方法。类似于\p{letter}的语法将在将来出现。

您可以使用以下表达式之一来匹配单个字母：

(?![\d_])\w

或

“unicode兼容”-例如，这是否意味着您希望同时匹配e和é？在Python中，请记住，要指示unicode字符串，必须使用以下内容：u'unicode string here'-假设您已经尝试过str.find（），其中str是您的unicode字符串？我的意思是我希望匹配，é，あ,日나 但不是1。（dot），９, 9等。例如。谢谢，但如果我使用像\u00E9-\u00F8这样的范围，我不知道字符是（CJK）标点符号还是0-9以外的数字符号。如果参考像这样的文档，可以使用字母范围，并选择所有字母间隔（这可能很无聊…）；这个链接也可以帮助你：在这里举个例子会很有帮助。谢谢！尽管有你提到的怪癖，我想我可以从这里开始，看看我能调什么。

(?![\d_])\w

\w(?<![\d_])

(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.