Python 如果匹配项不存在,如何匹配正则表达式中可能存在或可能不存在但具有占位符的字符串
假设我有一个大的文本文件,格式如下Python 如果匹配项不存在,如何匹配正则表达式中可能存在或可能不存在但具有占位符的字符串,python,regex,python-3.x,Python,Regex,Python 3.x,假设我有一个大的文本文件,格式如下 [Surname: "Gordon"] [Name: "James"] [Age: "13"] [Weight: "46"] [Height: "12"] [Quote: "I want to be a pilot"] [Name: "Monica"] [Weight: "33"] [Quote: "I am looking forward to christmas"] 总共有8把钥匙,它们总是按照“姓氏”、“姓名”、“年龄”、“体重”、“身高”、“学校”
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
总共有8把钥匙,它们总是按照“姓氏”、“姓名”、“年龄”、“体重”、“身高”、“学校”、“兄弟姐妹”、“报价”的顺序排列,我事先就知道了。如您所见,某些概要文件没有完整的变量集。你唯一能确定的就是名字
我想创建一个数据框架,每个观察值作为一行,每个列作为一个键。在James的例子中,由于他没有“School”和“Sibling”中的条目,我希望这些单元格的条目是numpy nan对象
我的尝试是对每个变量使用类似于(?:\[姓氏:\“()\”\])
的东西。但即使是一个姓的例子,我也遇到了问题。如果姓氏不存在,则不返回占位符,只返回空列表
更新:
作为一个例子,我希望莫妮卡的个人资料的回报是
(“”、'Monica'、“”、'33'、“”、“”、“”、''我期待圣诞节')您可以解析文件数据,将结果分组,并传递到数据帧:
import re
import pandas as pd
def group_results(d):
_group = [d[0]]
for a, b in d[1:]:
if a == 'Name' and not any(c == 'Name' for c, _ in _group):
_group.append([a, b])
elif a == 'Surname' and any(c == 'Name' for c, _ in _group):
yield _group
_group = [[a, b]]
else:
if a == 'Name':
yield _group
_group = [[a, b]]
else:
_group.append([a, b])
yield _group
headers = ["Surname","Name","Age","Weight","Height","School","Siblings","Quote"]
data = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
parsed = [(lambda x:[x[0], x[-1][1:-1]])(re.findall('(?<=^\[)\w+|".*?"(?=\]$)', i)) for i in data]
_grouped = list(map(dict, group_results(parsed)))
result = pd.DataFrame([[c.get(i, "") for i in headers] for c in _grouped], columns=headers)
基于@WiktorStribiżew comment,您可以使用(来自itertools)将行分组为空行和数据行,例如:
import re
from itertools import groupby
text = '''[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
[Name: "John"]
[Height: "33"]
[Quote: "I am looking forward to christmas"]
[Surname: "Gordon"]
[Name: "James"]
[Height: "44"]
[Quote: "I am looking forward to christmas"]'''
patterns = [re.compile('(\[Surname: "(?P<surname>\w+?)"\])'),
re.compile('(\[Name: "(?P<name>\w+?)"\])'),
re.compile('(\[Age: "(?P<age>\d+?)"\])'),
re.compile('\[Weight: "(?P<weight>\d+?)"\]'),
re.compile('\[Height: "(?P<height>\d+?)"\]'),
re.compile('\[Quote: "(?P<quote>.+?)"\]')]
records = []
for non_empty, group in groupby(text.splitlines(), key=lambda l: bool(l.strip())):
if non_empty:
lines = list(group)
record = {}
for line in lines:
for pattern in patterns:
match = pattern.search(line)
if match:
record.update(match.groupdict())
break
records.append(record)
for record in records:
print(record)
注意:这将创建一个字典,其中键是字段名,值是每个字段的值,此格式与您的预期输出不匹配,但我相信比您要求的更完整。在任何情况下,您都可以轻松地将此格式转换为所需的元组格式
解释
itertools的groupby函数将输入数据分组为连续的空行组和记录行组。然后,您只需要处理不为空的组。对于每一行,处理都很简单。如果模式匹配中断,则尝试匹配一个模式,假设每一个匹配的行都是独占的,利用命名组,使用字段值更新
记录字典。您可以重写数据文件。代码将原始文件解析为类D,然后使用csv.DictWriter将其写入正常样式的csv,该csv应可供pandas读取:
创建演示文件:
fn = "t.txt"
with open (fn,"w") as f:
f.write("""
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
""")
Itermediate类:
class D:
fields = ["Surname","Name","Age","Weight","Height","Quote"]
def __init__(self,textlines):
t = [(k.strip(),v.strip()) for k,v in (x.strip().split(":",1) for x in textlines)]
self.data = {k:"" for k in D.fields}
self.data.update(t)
def surname(self): return self.data["Surname"]
def name(self): return self.data["Name"]
def age(self): return self.data["Age"]
def weight(self): return self.data["Weight"]
def height(self): return self.data["Height"]
def quote(self): return self.data["Quote"]
def get_data(self):
return self.data
解析和重写:
fn = "t.txt"
# list of all collected D-Instances
data = []
with open(fn) as f:
# each dataset contains all lines belonging to one "person"
dataset = []
surname = False
for line in f.readlines():
clean = line.strip().strip("[]")
if clean and (clean.startswith("Surname") or clean.startswith("Name")):
if any(e.startswith("Name") for e in dataset):
data.append(D(dataset))
dataset = []
if clean:
dataset.append(clean)
else:
if clean:
dataset.append(clean)
elif clean:
dataset.append(clean)
if dataset:
data.append(D(dataset))
import csv
with open("other.txt", "w", newline="") as f:
dw = csv.DictWriter(f,fieldnames=D.fields)
dw.writeheader()
for entry in data:
dw.writerow(entry.get_data())
检查所写内容:
with open("other.txt","r") as f:
print(f.read())
输出:
Surname Name ... Siblings Quote
0 Gordon James ... I want to be a pilot
1 Monica ... I am looking forward to christmas
[2 rows x 8 columns]
Surname,Name,Age,Weight,Height,Quote
"""Gordon""","""James""","""13""","""46""","""12""","""I want to be a pilot"""
,"""Monica""",,"""33""",,"""I am looking forward to christmas"""
使用re.findall()为每个信息块创建(键、值)元组列表,并将它们放在单独的字典中:
text="""[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]"""
keys=['Surname','Name','Age','Weight','Height','Quote']
rslt=[{}]
for k,v in re.findall(r"(?m)(?:^\s*\[(\w+):\s*\"\s*([^\]\"]+)\"\s*\])+",text):
d=rslt[-1]
if (k=="Surname" and d) or (k=="Name" and "Name" in d):
d={}
rslt.append(d)
d[k]=v
for d in rslt:
print( [d.get(k,'') for k in keys] )
Out:
['Gordon', 'James', '13', '46', '12', 'I want to be a pilot']
['', 'Monica', '', '33', '', 'I am looking forward to christmas']
你是对的,所有这些地方都应该有冒号。我已经修好了。我把James放在后面,想用不同数量的变量来显示配置文件。你可以逐行阅读,并在空白行上收集数据。然后,将收集的数据添加到列表中。@WiktorStribiżew实际上这是个好主意。您的函数正是我希望它做的,但您能否简要解释一下组结果是如何工作的?@Qwertford组结果
为输入中的每个用户创建一个列表。具体来说,它将添加循环到队列
的值,但如果找到名称
值并且组中已经存在名称,则将生成并重新分配当前分组。
text="""[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]"""
keys=['Surname','Name','Age','Weight','Height','Quote']
rslt=[{}]
for k,v in re.findall(r"(?m)(?:^\s*\[(\w+):\s*\"\s*([^\]\"]+)\"\s*\])+",text):
d=rslt[-1]
if (k=="Surname" and d) or (k=="Name" and "Name" in d):
d={}
rslt.append(d)
d[k]=v
for d in rslt:
print( [d.get(k,'') for k in keys] )
Out:
['Gordon', 'James', '13', '46', '12', 'I want to be a pilot']
['', 'Monica', '', '33', '', 'I am looking forward to christmas']