Python 将表单Firstname Lastname上的名称与国际字符匹配_Python_Regex

Python 将表单Firstname Lastname上的名称与国际字符匹配

python regex

Python 将表单Firstname Lastname上的名称与国际字符匹配,python,regex,Python,Regex,我试图通过假设名字在表格Firstname Lastlame上来捕捉名字。这对下面的代码很有效，但我希望能够捕捉到国际名字，如PärÅberg。我找到了一些解决方案，但不幸的是，它们似乎不适用于Python风格的regexp。有人对这有见识吗 #!/usr/bin/python # -*- coding: utf-8 -*- import re text = """ This is a text containing names of people in the text such as

我试图通过假设名字在表格

Firstname Lastlame

上来捕捉名字。这对下面的代码很有效，但我希望能够捕捉到国际名字，如

PärÅberg

。我找到了一些解决方案，但不幸的是，它们似乎不适用于Python风格的regexp。有人对这有见识吗

#!/usr/bin/python
# -*- coding: utf-8 -*- 
import re

text = """
This is a text containing names of people in the text such as 
Hillary Clinton or Barack Obama. My problem is with names that uses stuff 
outside A-Z like Swedish names such as Pär Åberg."""

for name in re.findall("(([A-Z])[\w-]*(\s+[A-Z][\w-]*)+)", text):
    firstname = name[0].split()[0]
    print firstname

您需要一个替代方案，因为您可以使用

\p{L}

-任何Unicode字母

然后，使用

ur'\p{Lu}[\w-]*(?:\s+\p{Lu}[\w-]*)+'

使用Unicode字符串初始化正则表达式时，会自动使用

Unicode

标志：

如果既没有指定

ASCII

，

LOCALE

也没有指定

UNICODE

标志，则如果regex模式是UNICODE字符串，则默认为

UNICODE

，如果是bytestring，则默认为

ASCII

注意捕获组和findall。对于lastname，您可以搜索spacestry

re.findall（r'[A-Z][\w-]*（？：\s+[A-Z][\w-]*）+'

之间的任何字符。正确答案是使用

regex

module with

r'\p{Lu}[\w-]*（？：\s+\p{Lu Lu[\w-]*）”

。扩展@stribizev所说的内容，您需要包含本地（

）和Unicode（

）标志。工作起来很有魅力！除了使用正则表达式更新之外，我只需要编辑'firstname=name[0]。split（）[0]'）到'firstname=name.split（）[0]'。