Python 使用正则表达式从字符串中提取信息

Python 使用正则表达式从字符串中提取信息,python,regex,Python,Regex,这是对这个问题的后续和复杂化: 在这个问题上,我有以下几条线索-- 我想得到一个元组列表,格式为(演员,角色)-- 为了概括问题,我有一个稍微复杂的字符串,我需要提取相同的信息。我的绳子是-- 我需要将其格式化如下: [('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'), ('Stephen Root',''), ('Lauren Dern', 'Delilah')]

这是对这个问题的后续和复杂化:

在这个问题上,我有以下几条线索--

我想得到一个元组列表,格式为
(演员,角色)
--

为了概括问题,我有一个稍微复杂的字符串,我需要提取相同的信息。我的绳子是--

我需要将其格式化如下:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Stephen Root',''), ('Lauren Dern', 'Delilah')]
我知道我可以用、和、&等替换填充词,但如果演员没有角色名(在本例中为Stephen Root),我不太清楚如何添加空白条目--
'
。这样做的最佳方式是什么

最后,我需要考虑参与者是否有多个角色,并为参与者的每个角色构建一个元组。最后一个字符串是:

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"
我需要建立一个元组列表,如下所示:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),    
 ('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]
多谢各位

import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

characters = splitre.split(credits)
pairs = []
for character in characters:
    if character:
        match = matchre.match(character)
        if match:
            actor = match.group(1).strip()
            if match.group(2):
                parts = splitparts.split(match.group(2))
                for part in parts:
                    pairs.append((actor, part))
            else:
                pairs.append((actor, ""))

print(pairs)
输出:

[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
 ('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
 ('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

您需要的是识别以大写字母开头的单词序列,以及一些复杂情况(您不能假定每个名称都由姓名姓氏组成,但也包括姓名姓氏Jr.,或姓名M.姓氏,或其他本地化变体Jean-Claude van Damme,Louis da Silva等)

现在,对于您发布的示例输入来说,这可能有点过头了,但正如我在上面所写的,我认为事情很快就会变得一团糟,所以我将使用

下面是一个非常粗糙且测试不太好的代码片段,但它应该可以完成这项工作:

import nltk
from nltk.chunk.regexp import RegexpParser

_patterns = [
    (r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'),  # proper nouns
    (r'^[(]$', 'O'),
    (r'[,]', 'COMMA'),
    (r'^[)]$', 'C'),
    (r'.+', 'NN')                                   # nouns (default)
]

_grammar = """
        NAME: {<NNP> <COMMA> <NNP>}
        NAME: {<NNP>+}
        ROLE: {<O> <NAME>+ <C>}
        """    
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
tagger = nltk.RegexpTagger(_patterns)    
chunker = RegexpParser(_grammar)
text = text.replace('(', '( ').replace(')', ' )').replace(',', ' , ')
tokens = text.split()
tagged_text = tagger.tag(tokens)
tree = chunker.parse(tagged_text)

for n in tree:
    if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']: 
        print n

# output is:
# (NAME Will/NNP Ferrell/NNP)
# (ROLE (/O (NAME Nick/NNP Halsey/NNP) )/C)
# (NAME Rebecca/NNP Hall/NNP)
# (ROLE (/O (NAME Samantha/NNP) )/C)
# (NAME Glenn/NNP Howerton/NNP)
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP) )/C)
# (NAME Stephen/NNP Root/NNP)
# (NAME Laura/NNP Dern/NNP)
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP) )/C)
导入nltk
从nltk.chunk.regexp导入RegexpParser
_模式=[
(r'^[A-Z][A-zA-Z]*[A-Z]?[A-zA-Z]+.?$,'NNP'),#专有名词
(r"("O"),,
(r'[,]','逗号'),
(r"^[)]$及"C",,
(r“+”,“NN')#名词(默认值)
]
_语法=”“
名称:{}
名称:{+}
角色:{+}
"""    
text=“威尔·费雷尔(尼克·哈尔西)、丽贝卡·霍尔(萨曼莎)、格伦·豪厄顿(加里,布拉德)、斯蒂芬·罗特和劳拉·德恩(黛利拉,斯泰西)”
tagger=nltk.RegexpTagger(_模式)
chunker=RegexpParser(_语法)
text=text.replace(“(”,“(”).replace(“)”,“)”).replace(“,”,“,”)
tokens=text.split()
taged_text=tagger.tag(标记)
tree=chunker.parse(标记的文本)
对于树中的n:
如果在['ROLE','NAME']中存在实例(n,nltk.tree.tree)和n.node:
印刷品
#输出为:
#(姓名Will/NNP Ferrell/NNP)
#(角色(/O(姓名Nick/NNP Halsey/NNP))/C)
#(姓名Rebecca/NNP Hall/NNP)
#(角色(/O(姓名Samantha/NNP))/C)
#(名称:格伦/NNP豪厄顿/NNP)
#(角色(/O(名称Gary/NNP,/逗号Brad/NNP))/C)
#(名称Stephen/NNP Root/NNP)
#(姓名:Laura/NNP-Dern/NNP)
#(角色(/O(姓名Delilah/NNP,/逗号Stacy/NNP))/C)
然后,您必须处理标记的输出,并将名称和角色放入列表中,而不是打印,但您得到了图片

我们在这里做的是第一步,根据regex in_模式标记每个标记,然后第二步根据简单语法构建更复杂的块。您可以根据需要使语法和模式复杂化,即捕捉名称的变化、混乱的输入、缩写等

我认为用一个正则表达式过程来实现这一点对于非平凡的输入来说将是一件痛苦的事情


否则,将很好地解决您发布的输入的问题,并且没有nltk依赖项。

如果您需要非正则表达式解决方案。。。(假定没有嵌套的括号。)

威尔·费雷尔(尼克·哈尔西)、丽贝卡·霍尔(萨曼莎)、格伦·豪厄顿(加里,布拉德)、斯蒂芬·罗特和劳拉·德恩(黛利拉,斯泰西) in_list=[] 是否在\u paren=False 项目={} 下一个字符串=“” 索引=0 当索引 输出:
[('Will Ferrell','Nick Halsey'),('Rebecca Hall','Samantha'),('Glenn Howerton','Gary'),('Glenn Howerton','Brad'),('Stephen Root','','',,,('Laura Dern','Delilah'),('Laura Dern','Stacy')]。

Tim Pietzcker的解决方案可以简化为(注意,模式也被修改):

重新导入
credits=“”威尔·费雷尔(尼克·哈尔西)、丽贝卡·霍尔(萨曼莎)、格伦·豪厄顿(加里、布拉德)和
斯蒂芬·罗特和劳拉·德恩(黛利拉,斯泰西)”
#在逗号上拆分(仅在括号外),“带”或“和”
splitre=re.compile(r“(?:,(?![^()]*\)(?:\s*带)*|\b带\b\band\b)\s*”)
#匹配括号(1)前的部分和括号(2)内的部分
#(仅当有括号时)

matchre=re.compile(r“\s*([^(]*))(?@Michael:谢谢你的拼写编辑。使用正则表达式真的有必要吗?不,它可以是任何东西。任何有效且最好的。你可以将
交换到
(),
;然后用
删除
,与第一个示例相同,但我认为您应该自己构建某种解析器:)对于您的第二部分(演员有两个角色),我将执行与上面相同的操作(使用
('Glenn Howerton','Gary
import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

characters = splitre.split(credits)
pairs = []
for character in characters:
    if character:
        match = matchre.match(character)
        if match:
            actor = match.group(1).strip()
            if match.group(2):
                parts = splitparts.split(match.group(2))
                for part in parts:
                    pairs.append((actor, part))
            else:
                pairs.append((actor, ""))

print(pairs)
[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
 ('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
 ('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]
import nltk
from nltk.chunk.regexp import RegexpParser

_patterns = [
    (r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'),  # proper nouns
    (r'^[(]$', 'O'),
    (r'[,]', 'COMMA'),
    (r'^[)]$', 'C'),
    (r'.+', 'NN')                                   # nouns (default)
]

_grammar = """
        NAME: {<NNP> <COMMA> <NNP>}
        NAME: {<NNP>+}
        ROLE: {<O> <NAME>+ <C>}
        """    
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
tagger = nltk.RegexpTagger(_patterns)    
chunker = RegexpParser(_grammar)
text = text.replace('(', '( ').replace(')', ' )').replace(',', ' , ')
tokens = text.split()
tagged_text = tagger.tag(tokens)
tree = chunker.parse(tagged_text)

for n in tree:
    if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']: 
        print n

# output is:
# (NAME Will/NNP Ferrell/NNP)
# (ROLE (/O (NAME Nick/NNP Halsey/NNP) )/C)
# (NAME Rebecca/NNP Hall/NNP)
# (ROLE (/O (NAME Samantha/NNP) )/C)
# (NAME Glenn/NNP Howerton/NNP)
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP) )/C)
# (NAME Stephen/NNP Root/NNP)
# (NAME Laura/NNP Dern/NNP)
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP) )/C)
in_string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"    

in_list = []
is_in_paren = False
item = {}
next_string = ''

index = 0
while index < len(in_string):
    char = in_string[index]  

    if in_string[index:].startswith(' and') and not is_in_paren:
        actor = next_string
        if actor.startswith(' with '):
            actor = actor[6:]
        item['actor'] = actor
        in_list.append(item)
        item = {}
        next_string = ''
        index += 4    
    elif char == '(':
        is_in_paren = True
        item['actor'] = next_string
        next_string = ''    
    elif char == ')':
        is_in_paren = False
        item['part'] = next_string
        in_list.append(item)
        item = {}                 
        next_string = ''
    elif char == ',':
        if is_in_paren:
            item['part'] = next_string
            next_string = ''
            in_list.append(item)
            item = item.copy()
            item.pop('part')                
    else:
        next_string = "%s%s" % (next_string, char)

    index += 1


out_list = []
for dict in in_list:
    actor = dict.get('actor')
    part = dict.get('part')

    if part is None:
        part = ''

    out_list.append((actor.strip(), part.strip()))

print out_list
import re
credits = """   Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

pairs = []
for character in splitre.split(credits):
    gr = matchre.match(character).groups('')
    for part in splitparts.split(gr[1]):
        pairs.append((gr[0], part))

print(pairs)
import re
credits = """   Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
 Stephen Root and Laura Dern (Delilah, Stacy)"""

# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")

# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")

# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")

gen = (matchre.match(character).groups('') for character in splitre.split(credits))

pp = [ (gr[0], part) for gr in gen for part in splitparts.split(gr[1])]

print pp