Python 从给定格式中提取所需名称
我有一个包含如下所示数据的文本文件。我必须从中提取一些必需的名称。我正在尝试下面的代码,但没有得到所需的结果 该文件包含以下数据:Python 从给定格式中提取所需名称,python,regex,python-3.x,data-extraction,Python,Regex,Python 3.x,Data Extraction,我有一个包含如下所示数据的文本文件。我必须从中提取一些必需的名称。我正在尝试下面的代码,但没有得到所需的结果 该文件包含以下数据: Leader : Tim Lee ; 34567 Head\Organiser: Sam Mathews; 11:53 am Head: Alica Mills; 45612 Head\Secretary: Maya Hill; #53190 Captain- Jocey David # 45123 Vice Captain:- Jacob Green;
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432
我正在尝试的代码:
import re
pattern = re.compile(r'(Leader|Head\\Organiser|Captain|Vice Captain).*(\w+)',re.I)
matches=pattern.findall(line)
for match in matches:
print(match)
预期产出:
Tim Lee
Sam Mathews
Jocey David
Jacob Green
说明:
(?: : start non capture group
Leader : literally
| : OR
Head : literally
(?: : start non capture group
\\Organiser : literally
| : OR
\\Secretary : literally
)? ! end group, optional
| : OR
Captain : literally
| : OR
Vice Captain : literally
) : end group
\W+ : 1 or more non word character
( : start group 1
\w+ : 1 or more word char
(?: : non capture group
\s+ : 1 or more spaces
\w+ : 1 or more word char
)? : end group, optional
) : end group 1
Tim Lee
Sam Mathews
Alica Mills
Maya Hill
Jocey David
Jacob Green
给定示例的结果:
(?: : start non capture group
Leader : literally
| : OR
Head : literally
(?: : start non capture group
\\Organiser : literally
| : OR
\\Secretary : literally
)? ! end group, optional
| : OR
Captain : literally
| : OR
Vice Captain : literally
) : end group
\W+ : 1 or more non word character
( : start group 1
\w+ : 1 or more word char
(?: : non capture group
\s+ : 1 or more spaces
\w+ : 1 or more word char
)? : end group, optional
) : end group 1
Tim Lee
Sam Mathews
Alica Mills
Maya Hill
Jocey David
Jacob Green
说明:
(?: : start non capture group
Leader : literally
| : OR
Head : literally
(?: : start non capture group
\\Organiser : literally
| : OR
\\Secretary : literally
)? ! end group, optional
| : OR
Captain : literally
| : OR
Vice Captain : literally
) : end group
\W+ : 1 or more non word character
( : start group 1
\w+ : 1 or more word char
(?: : non capture group
\s+ : 1 or more spaces
\w+ : 1 or more word char
)? : end group, optional
) : end group 1
Tim Lee
Sam Mathews
Alica Mills
Maya Hill
Jocey David
Jacob Green
给定示例的结果:
(?: : start non capture group
Leader : literally
| : OR
Head : literally
(?: : start non capture group
\\Organiser : literally
| : OR
\\Secretary : literally
)? ! end group, optional
| : OR
Captain : literally
| : OR
Vice Captain : literally
) : end group
\W+ : 1 or more non word character
( : start group 1
\w+ : 1 or more word char
(?: : non capture group
\s+ : 1 or more spaces
\w+ : 1 or more word char
)? : end group, optional
) : end group 1
Tim Lee
Sam Mathews
Alica Mills
Maya Hill
Jocey David
Jacob Green
鉴于:
您可以获得如下名称:
>>> [e.rstrip() for e in re.findall(r'[:-]+[ \t]+(.*?)[;#]',s)]
['Tim Lee', 'Sam Mathews', 'Alica Mills', 'Maya Hill', 'Jocey David', 'Jacob Green']
或者,创建标题和相关名称的目录:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Head|Head\\Secretary|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Head': 'Alica Mills', 'Head\\Secretary': 'Maya Hill', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
然后可以将其限制为所需的标题:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
如果您只需要名称(Python 3.6+保持顺序,因此它们将按字符串顺序排列):
给定:
s='''\
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432'''
您可以获得如下名称:
>>> [e.rstrip() for e in re.findall(r'[:-]+[ \t]+(.*?)[;#]',s)]
['Tim Lee', 'Sam Mathews', 'Alica Mills', 'Maya Hill', 'Jocey David', 'Jacob Green']
或者,创建标题和相关名称的目录:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Head|Head\\Secretary|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Head': 'Alica Mills', 'Head\\Secretary': 'Maya Hill', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
然后可以将其限制为所需的标题:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
如果您只需要名称(Python 3.6+保持顺序,因此它们将按字符串顺序排列):
非常感谢您。事实上,在第二行,它可能只是头:山姆马修斯;上午11:53,没有“组织者”,因此此代码可能无法工作,因为我们正在使用\W+,有办法解决吗?很抱歉以后添加了一些要求。请解决我的这个问题:这是我问题的完美解决方案:pattern=re.compile(r'(?:Leader | Head(?\\\W+)| Captain |副Captain)\W+(\W+(?:\s+\W+)),re.I)非常感谢。事实上,在第二行,它可能只是头:山姆马修斯;上午11:53,没有“组织者”,因此此代码可能无法工作,因为我们正在使用\W+,有办法解决吗?很抱歉以后添加了一些要求。请解决我的问题:这是我问题的完美解决方案:pattern=re.compile(r'(?:Leader | Head(?\\\W+)\W+(\W+(?:\s+\W+)),re.I)谢谢。是的,很有帮助。谢谢。是的,这很有帮助。