如何在Python中编写匹配所有unicode字符的正则表达式？_Python_Regex_Python 2.7

如何在Python中编写匹配所有unicode字符的正则表达式？

python regex python-2.7

如何在Python中编写匹配所有unicode字符的正则表达式？,python,regex,python-2.7,Python,Regex,Python 2.7,我以前读过，写过数百个正则表达式，但我不知道如何检测unicode字母序列 # this will detect sequence of English letters re.compile(r'[a-zA-Z]+') # this will detect sequence of Unicode letters + [0-9_] re.compile(r'\w+', re.UNICODE) # how to detect sequence only unicode letter (without

我以前读过，写过数百个正则表达式，但我不知道如何检测unicode字母序列

# this will detect sequence of English letters re.compile(r'[a-zA-Z]+') # this will detect sequence of Unicode letters + [0-9_] re.compile(r'\w+', re.UNICODE) # how to detect sequence only unicode letter (without [0-9_]) re.compile(r'????', re.UNICODE)
如何只匹配unicode字符而不匹配[0-9?

我测试了您的解决方案：

import re import timeit def test1(): regex = re.compile(ur'(?:(?![\d_])\w)+', re.UNICODE) return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.') def test2(): regex = re.compile(ur'[^\W\d_]+', re.UNICODE) return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.') print test1() print test2() print timeit.timeit(test1) print timeit.timeit(test2)
时间是：

[u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo'] [u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo'] 11.0143377108 7.42619199741

您可以将负前瞻与
\w
相结合，以匹配不包括数字和下划线的“单词字符”：

re.compile(r"(?:(?![\d_])\w)+", re.UNICODE)
试试这个它匹配任何不带数字的unicode字符

re.compile(r'\D')

使用Unicode字符串和源编码，然后查找您在注释中指定的字符。Python 2.7没有“Unicode alpha字符”的快捷方式：
输出：

The quick brown fóx jumped over Łhe laży dog times

还可以查看是否需要Unicode考虑的所有大小写Unicode字母。
您对“Unicode字符”的定义是什么？“Unicode”包括Unicode规范中的所有字符。可能
re.compile（r'[^0-9]，re.Unicode）
您必须自己找到所需字符的所有范围。您的意思是要匹配所有单词字符（用于在任何语言中形成单词）除了标准拉丁字符A-Z和标准数字0-9？标点符号呢？空白？控制字符？符号字符（如数学符号）？你越清楚你的要求，你就越有可能得到一个好的答案。@Aaron
[^0-9\]
不是字母，而是空格-失败了。失败了
>>re.findall（r'（？：（？！[\d\]]\w）+“Ala ma kota z czarnąsierąci.”，re.UNICODE）==“Ala”、“ma”、“kota”、“z”、“czarn\xb9”、“sier”、“ci\xb9”]
我怀疑这是字符串的编码问题。对于我来说，使用Python3是可行的。如果您使用的是Python2，请尝试在字符串的引号前加上
u
，使其成为Unicode文字。这也可以使用空格和符号，并且需要
re.Unicode
标志。失败。也匹配空格。您的解决方案不是很好的模式，因为它只适用于波兰语-更好的是
[^\W\d\u]
，我认为，但需要测试或
（？：（？！[\d\u]）\W）+
@Chameleon，也可以查看链接的答案以获得完整的解决方案。@Chameleon，
[^\W\d\u]
如果添加Unicode标志，则可以工作。请参阅更新版，但请确保使用Unicode字符串。我一直使用Unicode，因为我在执行使用波兰语、德语和英语的全球程序。
The quick brown fóx jumped over Łhe laży dog times