Python 使用模式中包含键的正则表达式在字典中插入值_Python_Regex_Data Extraction

Python 使用模式中包含键的正则表达式在字典中插入值

python regex

Python 使用模式中包含键的正则表达式在字典中插入值,python,regex,data-extraction,Python,Regex,Data Extraction,我试图从PDF文件中提取数据，所以我将转换后的文本文件的每一行都读入一个列表。我有一个预定义的列表，它将用作键。我想用预定义列表中的键创建一个字典，并提取相应的值。例如，该文件将包含 Name : Luke Cameron Age and Sex : 37/Male Haemoglobin 13.0 g/dL 我有一个预定义的列表，比如 keys=['Name'，'Age'，'Sex'] 我的代码是 for text in lines: rx_d

我试图从PDF文件中提取数据，所以我将转换后的文本文件的每一行都读入一个列表。我有一个预定义的列表，它将用作键。我想用预定义列表中的键创建一个字典，并提取相应的值。例如，该文件将包含

Name  : Luke Cameron 
Age and Sex : 37/Male
Haemoglobin       13.0            g/dL

我有一个预定义的列表，比如

keys=['Name'，'Age'，'Sex']

我的代码是

for text in lines:
    rx_dict = {elem:re.search(str(elem)+r':\s+\w+.\s\w+',text) for elem in keys}

输出：

{'Patient Name': None,
 'Age': None,
 'Sex': None
}

期望输出：

{'Patient Name': Luke Cameron,
 'Age': 37,
 'Sex': Male
}

注意：这不是真实数据，相似性只是巧合

strings_from_pdf = ["Name  : Luke Cameron", "Age and Sex : 37/Male", "Haemoglobin  13.0  g/dL"]
keys = ['Name', 'Age', 'Sex']

def findKeys(keys):
    dict = {}
    for i in range(len(strings_from_pdf)):
        if keys[0] in strings_from_pdf[i]:
            _, name = strings_from_pdf[i].split(":")
            dict['Patient Name: '] = name
        if keys[1] in strings_from_pdf[i]:
            _, age_and_gender = strings_from_pdf[i].split(":")
            age, gender = age_and_gender.split("/")
            dict['Age: '] = age
            dict['Gender: '] = gender
    return dict

dict = findKeys(keys)

import re

data = """
Name  : Luke Cameron 
Age and Sex : 37/Male
Haemoglobin       13.0            g/dL"""

rx = re.compile(r'^(?P<key>[^:\n]+):(?P<value>.+)', re.M)

result = {}
for match in rx.finditer(data):
    key = match.group('key').rstrip()
    value = match.group('value').strip()
    try:
        key1, key2 = key.split(" and ")
        value1, value2 = value.split("/")
        result.update({key1: value1, key2: value2})
    except ValueError:
        result.update({key: value})

print(result)

以下是一种非正则表达式方法：

txt = """\
Name  : Luke Cameron 
Age and Sex : 37/Male
Haemoglobin       13.0            g/dL"""

keys=('Patient Name','Age','Sex')

ans={}
for t in (line.partition(':') for line in txt.splitlines() if line.partition(':')[2]):
    if sum(n in t[0] for n in keys)>1:
        ans.update(
           {k.strip():v.strip() for k,v in zip(t[0].split(' and '), t[2].split('/'))})
    else:
        ans[t[0].strip()]=t[2].strip()

>>> ans
{'Name': 'Luke Cameron', 'Age': '37', 'Sex': 'Male'}

谢谢，我的文件中有不需要的空格和缩进，所以我也要处理它，但这是一个很好的解决方案。我正在寻找一些可以很好推广的东西，因为我有大约400个键。谢谢你的回答

txt = """\
Name  : Luke Cameron 
Age and Sex : 37/Male
Haemoglobin       13.0            g/dL"""

keys=('Patient Name','Age','Sex')

ans={}
for t in (line.partition(':') for line in txt.splitlines() if line.partition(':')[2]):
    if sum(n in t[0] for n in keys)>1:
        ans.update(
           {k.strip():v.strip() for k,v in zip(t[0].split(' and '), t[2].split('/'))})
    else:
        ans[t[0].strip()]=t[2].strip()

>>> ans
{'Name': 'Luke Cameron', 'Age': '37', 'Sex': 'Male'}