在Python中解析原始文本数据并提取特定值_Python_Python 3.x_Parsing_Text

在Python中解析原始文本数据并提取特定值

python python-3.x parsing text

在Python中解析原始文本数据并提取特定值,python,python-3.x,parsing,text,Python,Python 3.x,Parsing,Text,我的数据库中的一列以下面提到的格式存储文本信息。文本不是标准格式，有时在“保险日期”字段之前可能会有其他文本。当我在Python中进行拆分时，它可能会将这个“保险日期”放在不同的列中。在这种情况下，我需要在所有列中搜索值“保险日期” 示例文本 "Accumulation Period - period of time insured must incur eligible medical expenses at least equal to the deductible amount in or

我的数据库中的一列以下面提到的格式存储文本信息。文本不是标准格式，有时在“保险日期”字段之前可能会有其他文本。当我在Python中进行拆分时，它可能会将这个“保险日期”放在不同的列中。在这种情况下，我需要在所有列中搜索值“保险日期”

示例文本

"Accumulation Period - period of time insured must incur eligible medical expenses at least equal to the deductible amount in order to establish a benefit period under a major medical expense or comprehensive medical expense policy.\n
Insurance Date 12/17/2018\n
Insurance Number 235845\n
Carrier Name SKGP\n
Coverage $240000"

预期结果

INS_NO     Insurance Date     Carrier Name
235845    12/17/2018          SKGP

我们如何解析像这样的原始文本信息并提取保险日期的值

我使用下面的逻辑来提取这个，但我不知道如何将日期提取到另一列中

df= pd.read_sql(query, conn)
df2=df["NOTES"].str.split("\n", expand=True)

如果我理解正确，这可能会让你接近你所需要的：

insurance = """
"Accumulation Period - period of time insured must incur eligible medical expenses at least equal to the deductible amount in order to establish a benefit period under a major medical expense or comprehensive medical expense policy.\n
Insurance Date 12/17/2018\n
Insurance Number 235845\n
Carrier Name SKGP\n
Coverage $240000"
"""

items = insurance.split('\n')
filtered_items = list(filter(lambda x: x != "", items))
del filtered_items[0]
del filtered_items[-1]
row = []
for item in filtered_items:
    row.append(item.split(' ')[-1])

columns = ["INS_NO ", "Insurance Date", "Carrier Name"]      
df = pd.DataFrame([row],columns=columns)
df

输出：

    INS_NO  Insurance Date  Carrier Name
0   12/17/2018  235845     SKGP

使用正则表达式如果文本遵循某种模式（或多或少），则可以使用

regex.

有关正则表达式操作，请参阅python文档

例子请参阅并试用两种可能解决方案的代码。
下面是一个简化的例子

text=”“”
累积期-被保险人必须承担至少等于可扣除金额的合格医疗费用的时间段，以便根据重大医疗费用或综合医疗费用保单确定受益期。
保险日期2018年12月17日
保险号码235845
承运人名称SKGP
保险额24万美元
"""
pattern=re.compile（r“保险日期（.*）\n保险号（.*）\n承运人名称（.*）\n”）
匹配=模式。搜索（文本）
打印（“找到：”）
如果匹配：
对于match.groups（）中的g：
印刷品（g）

输出

Found:
12/17/2018
235845
SKGP

如果总是这种格式，使用正则表达式可能会更容易。我不知道如何将日期提取到另一列中。你能更具体一点吗？请提供一个。