Python 提取括号内字符串的内容_Python

Python 提取括号内字符串的内容

python

Python 提取括号内字符串的内容,python,Python,我有以下字符串： string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)" 我想以[（actor\u name，character\u name），…]的形式创建一个元组列表，如下所示： [(Will Ferrell, Nick Halsey), (Rebecca Hall, Samantha), (Michael Pena, Frank Garcia)] 我目前正在

我有以下字符串：

string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)"

我想以

[（actor\u name，character\u name），…]的形式创建一个元组列表，如下所示：
[(Will Ferrell, Nick Halsey), (Rebecca Hall, Samantha), (Michael Pena, Frank Garcia)]

我目前正在使用一种类似黑客的方法来实现这一点，方法是按（
标记拆分，然后使用.rstrip（“（”），如下所示：
for item in string.split(','):
    item.rstrip(')').split('(')

有更好、更健壮的方法吗？谢谢。
正则表达式的好地方：
>>> import re
>>> pat = "([^,\(]*)\((.*?)\)"
>>> re.findall(pat, "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)")
[('Will Ferrell ', 'Nick Halsey'), (' Rebecca Hall ', 'Samantha'), (' Michael Pena ', 'Frank Garcia')]

这是一个比其他答案更明确的答案，我认为它符合您的需要：
import re
regex = re.compile(r'([a-zA-Z]+ [a-zA-Z]+) \(([a-zA-Z]+ [a-zA-Z]+)\)')
actor_character = regex.findall(string)

我承认这有点难看，但就像我说的更明确
string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)"

import re
pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)')

lst = [(t[0].strip(), t[1].strip()) for t in pat.findall(string)]

编译后的模式有点棘手。它是一个原始字符串，可以使反斜杠不那么疯狂。它的意思是：启动一个匹配组；匹配任何不是“（”字符的内容，只要它至少是一次，就可以多次匹配；关闭匹配组；匹配一个文本“（”字符；启动另一个匹配组；匹配任何不是“（”字符的内容）'字符，任何次数，只要它至少是一次；关闭匹配组；匹配一个文字''字符；然后匹配任何空白（包括无）；然后是非常棘手的事情。真正棘手的部分是一个不构成匹配组的分组。它不是以'（'开头，以'）'结尾，而是以“（？：”开头然后再次以“'）结尾。我使用了这个分组，这样我就可以在其中放置一个竖条，以允许两种可选模式：要么是逗号匹配，后跟任意数量的空格，要么是行尾（“$”字符）
然后我使用pat.findall（）
查找模式匹配的string
中的所有位置；它自动返回元组。我将其放入列表理解中，并对每个项目调用.strip（）
，以清除空白
当然，我们可以让正则表达式变得更复杂，让它返回已经去掉空白的名称。不过，正则表达式会变得非常复杂，因此我们将使用Python正则表达式中最酷的特性之一：“verbose”模式，在该模式中，您可以将一个模式延伸到多行，并根据自己的喜好放置注释。我们使用原始的三引号字符串，因此反斜杠和多行都很方便。给您：
import re
s_pat = r'''
\s*  # any amount of white space
([^( \t]  # start match group; match one char that is not a '(' or space or tab
[^(]*  # match any number of non '(' characters
[^( \t])  # match one char that is not a '(' or space or tab; close match group
\s*  # any amount of white space
\(  # match an actual required '(' char (not in any match group)
\s*  # any amount of white space
([^) \t]  # start match group; match one char that is not a ')' or space or tab
[^)]*  # match any number of non ')' characters
[^) \t])  # match one char that is not a ')' or space or tab; close match group
\s*  # any amount of white space
\) # match an actual required ')' char (not in any match group)
\s*  # any amount of white space
(?:,|$)  # non-match group: either a comma or the end of a line
'''
pat = re.compile(s_pat, re.VERBOSE)

lst = pat.findall(string)

老兄，这真的不值得这么努力
此外，上述方法保留了名称中的空白。您可以通过在空白处拆分并用空格重新连接，轻松规范化空白，以确保其100%一致
string = '  Will   Ferrell  ( Nick\tHalsey ) , Rebecca Hall (Samantha), Michael\fPena (Frank Garcia)'

import re
pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)')

def nws(s):
    """normalize white space.  Replaces all runs of white space by a single space."""
    return " ".join(w for w in s.split())

lst = [tuple(nws(item) for item in t) for t in pat.findall(string)]

print lst # prints: [('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), ('Michael Pena', 'Frank Garcia')]

现在，字符串有愚蠢的空白：多个空格，一个选项卡，甚至还有一个表单提要（“\f”）在它里面。上面将它清理干净，这样名称就可以用一个空格分隔。
如果没有嵌套的括号，你可以用正则表达式这样做。我认为它更容易阅读，而且我喜欢正则表达式的限制。这个正则表达式将不会匹配任何名称中带有Unicode字符的参与者，例如重音字母。它也会失败有两个以上名字或标点符号的演员；例如，“塞缪尔·L·杰克逊”或“小罗伯特·唐尼”如果你在解析类似于计算机语言的东西，并且你知道输入可以解决这个问题，我喜欢它简洁明了。我自己的答案更难看，但更健壮。我同意。我有点忘记了unicode…或者任何关于这一点的复杂性，我正在寻找样本的最快解决方案。而且，这个正则表达式也无法解决这个问题因为她的角色名（“萨曼莎”）没有空格。