Python 根据条件解析文本数据并对齐列_Python_Pandas

Python 根据条件解析文本数据并对齐列

python pandas

Python 根据条件解析文本数据并对齐列,python,pandas,Python,Pandas,我有下面的文本数据，我需要根据下面的条件解析并拆分成列任何以=开头的内容都应归入ENC\u NAME 任何包含BladeSystem的行，行尾的编号应位于OA_版本任何包含1 HP的行都应位于VC\u ACTIVE列下任何包含2 HP的行都应位于列VC\U STDN 文本数据期望输出（示例）：编辑（我试过的）任何帮助或想法都会非常有用。如评论中所建议，使用pandas打开文件，解析并不理想假设您的数据保存在文本文件file.txt 将熊猫作为pd导入打开（“file.tx

我有下面的文本数据，我需要根据下面的条件解析并拆分成列

任何以

开头的内容都应归入

ENC\u NAME

任何包含

BladeSystem

的行，行尾的编号应位于

OA_版本

任何包含

1 HP

的行都应位于

VC\u ACTIVE列下

任何包含

2 HP

的行都应位于列

VC\U STDN

文本数据期望输出（示例）：编辑（我试过的）

任何帮助或想法都会非常有用。

如评论中所建议，使用

pandas

打开文件，解析并不理想

假设您的数据保存在文本文件

file.txt

将熊猫作为pd导入
打开（“file.txt”）作为文件：
行=[l.rstrip（“\n”）表示文件中的l]
行温度=[无]*4
行=无
out=[]
对于行中的行：
如果行.startswith（“=”）：
如果行不是无：
out.append（行）
行=行临时复制（）
行[0]=行。替换（“=”，“”）。rstrip（）.lstrip（）
如果“BladeSystem”在线：
行[1]=行分割（“”[-1]
如果管路中有“1 HP”：
行[2]=行分割（“”[-1]
如果管路中有“2 HP”：
行[3]=行分割（“”[-1]
col_NAME=[“ENC_NAME”、“OA_VERSION”、“VC_ACTIVE”、“VC_STDN”]
df=pd.数据帧（输出，
列=列（名称）

返回您正在查找的输出。

您可以尝试以下操作：

import pandas as pd
import re
import numpy as np

with open(r'test1.txt','r') as file:
    txto=file.read()

data=[]
pattern1 = re.compile('(^\=.+)\s.+$\n?', re.MULTILINE)
lstlines=txto.split('\n')

for ele1, ele2 in zip(re.findall(pattern1,txto),re.findall(pattern1,txto)[1:]):
    row=lstlines[lstlines.index(ele1):lstlines.index(ele2)]

    OA_VERSION=[i for i in row if 'BladeSystem' in i]
    OA_VERSION=OA_VERSION[0].split()[-1] if len(OA_VERSION)>0 else np.nan
    
    VC_ACTIVE=[i for i in row if '1 HP' in i]
    VC_ACTIVE=VC_ACTIVE[0].split()[-1] if len(VC_ACTIVE)>0 else np.nan
    
    VC_STDN=[i for i in row if '2 HP' in i]
    VC_STDN=VC_STDN[0].split()[-1] if len(VC_STDN)>0 else np.nan
    
    data.append([ele1.replace('=','').strip(),OA_VERSION, VC_ACTIVE,VC_STDN])
    
#last row 
row=lstlines[lstlines.index(re.findall(pattern1,txto)[-1]):]
OA_VERSION=[i for i in row if 'BladeSystem' in i]
OA_VERSION=OA_VERSION[0].split()[-1] if len(OA_VERSION)>0 else np.nan
VC_ACTIVE=[i for i in row if '1 HP' in i]
VC_ACTIVE=VC_ACTIVE[0].split()[-1] if len(VC_ACTIVE)>0 else np.nan
VC_STDN=[i for i in row if '2 HP' in i]
VC_STDN=VC_STDN[0].split()[-1] if len(VC_STDN)>0 else np.nan
data.append([re.findall(pattern1,txto)[-1].replace('=','').strip(),OA_VERSION, VC_ACTIVE,VC_STDN]) 

#Create dataframe
df=pd.DataFrame(data, columns=['ENC_NAME ','OA_VERSION','VC_ACTIVE','VC_STDN'])
print(df)

输出：

df
   ENC_NAME  OA_VERSION VC_ACTIVE VC_STDN
0    enc1001       4.85      4.50    4.50
1    enc1002       4.85      4.50    4.50
2    enc1003       4.85      4.50    4.50
3    enc1004       4.85      4.50    4.50
4    enc1005       4.85      4.50    4.50
..       ...        ...       ...     ...
94   enc8025       4.85      4.62    4.62
95   enc8026       4.85      4.62    4.62
96   enc8027       4.85      4.62    4.62
97   enc8028       4.85      4.62    4.62
98   enc8033       4.85      4.40    4.40

[99 rows x 4 columns]

在我看来，应该使用一个自编解析器。您所拥有的可以看作是所谓DSL的一种形式，一种特定于领域的语言。这里使用的语法相当宽容：

import re, pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

class ENCVisitor(NodeVisitor):
    grammar = Grammar(r"""
            content     = (ws / block)*

            block       = header oa_line vc_active? vc_stdn?
            header      = delim ws word ws delim nl

            oa_line     = ~"^(?=.*BladeSystem).+"m nl?
            vc_active   = ~"^(?=.*1 HP).+"m nl?
            vc_stdn     = ~"^(?=.*2 HP).+"m nl?

            word        = ~"\w+"
            delim       = ~"=+"
            ws          = ~"\s+"
            nl          = ~"[\n\r]+"
    """)

    version_pattern = re.compile(r"\d+\.\d+$")

    def get_version(self, key, line):
        match = self.version_pattern.search(line)
        value = match.group(0) if match else None
        return {key: value}

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_header(self, node, visited_children):
        header = visited_children[2]
        return {"ENC_NAME": header.text}

    def visit_oa_line(self, node, visited_children):
        line, _ = visited_children
        return self.get_version("OA_VERSION", line.text)

    def visit_vc_active(self, node, visited_children):
        line, _ = visited_children
        return self.get_version("VC_ACTIVE", line.text)

    def visit_vc_stdn(self, node, visited_children):
        line, _ = visited_children
        return self.get_version("VC_STDN", line.text)

    def visit_block(self, node, visited_children):
        dct = {}
        for child in visited_children:
            if isinstance(child, dict):
                dct.update(child)
            elif isinstance(child, list):
                dct.update(child[0])
        return dct

    def visit_content(self, node, visited_children):
        return [child[0] for child in visited_children if isinstance(child[0], dict)]

enc = ENCVisitor()
result = enc.parse(data)

df = pd.DataFrame(result)
print(df)

对于您提供的数据，这将导致

   ENC_NAME OA_VERSION VC_ACTIVE VC_STDN
0   enc1001       4.85      4.50    4.50
1   enc1002       4.85      4.50    4.50
2   enc1003       4.85      4.50    4.50
3   enc1004       4.85      4.50    4.50
4   enc1005       4.85      4.50    4.50
..      ...        ...       ...     ...
94  enc8025       4.85      4.62    4.62
95  enc8026       4.85      4.62    4.62
96  enc8027       4.85      4.62    4.62
97  enc8028       4.85      4.62    4.62
98  enc8033       4.85      4.40    4.40

[99 rows x 4 columns]

解释：您的输入可以被看作是一种自己的小型语言，即所谓的领域特定语言。文件中的每个信息块包括一个标题行、一个

OA\u版本

行和两个可能存在或不存在的行（

VC\u活动

和

VC\u STDN

）。标题行始终以

==

开头和结尾

所有这些块都构成了一个语法，即文件/字符串中的空白或多个块。在内部，我们建立了一个抽象syntrax树（ast），为了检索信息，我们需要“访问”每个节点。在我选择使用的解析器库中（优秀的

简约的

），这是通过

NodeVisitor

类完成的，ast的每个叶都是通过相应的函数名访问的。这意味着如果我们将一个部分称为“header”，则函数应命名为“visit_header”

结果通过“visit_块”获取，是该块所有检索信息的字典。最后，所有的东西都被送入

pandas

当然，这只能是一个简短的介绍，如果您想了解更多有关

parsimonius

，请查看。

使用解析器。@Jan，谢谢您的输入，您是说创建python解析器吗？非常感谢@rpanai，我将尝试+1如果我将其应用于大型数据集，那么它不会打印

OA_VERSION

列的值。我刚刚在帖子上添加了文本数据集。我发现有空间导致了问题，我修复了它，但问题是它没有占用最后一块数据，即

enc8033

My bad，对于最后一个数据块，您应该在循环外添加

out.append（row）

。非常感谢@MrNobody33，我将尝试+1请您通过读取文件中的数据来编辑它。提供的测试数据集上也存在同样的问题，它工作正常，但对于相同类型的较大数据集，它会给出错误的结果。我刚刚上传了实际的数据集。简，非常感谢你制作了你的解决方案版本，+1同样，是的。。。如果您能解释，我将不胜感激。非常感谢您的详细解释，祝您度过愉快的一天。

df
   ENC_NAME  OA_VERSION VC_ACTIVE VC_STDN
0    enc1001       4.85      4.50    4.50
1    enc1002       4.85      4.50    4.50
2    enc1003       4.85      4.50    4.50
3    enc1004       4.85      4.50    4.50
4    enc1005       4.85      4.50    4.50
..       ...        ...       ...     ...
94   enc8025       4.85      4.62    4.62
95   enc8026       4.85      4.62    4.62
96   enc8027       4.85      4.62    4.62
97   enc8028       4.85      4.62    4.62
98   enc8033       4.85      4.40    4.40

[99 rows x 4 columns]

import re, pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

class ENCVisitor(NodeVisitor):
    grammar = Grammar(r"""
            content     = (ws / block)*

            block       = header oa_line vc_active? vc_stdn?
            header      = delim ws word ws delim nl

            oa_line     = ~"^(?=.*BladeSystem).+"m nl?
            vc_active   = ~"^(?=.*1 HP).+"m nl?
            vc_stdn     = ~"^(?=.*2 HP).+"m nl?

            word        = ~"\w+"
            delim       = ~"=+"
            ws          = ~"\s+"
            nl          = ~"[\n\r]+"
    """)

    version_pattern = re.compile(r"\d+\.\d+$")

    def get_version(self, key, line):
        match = self.version_pattern.search(line)
        value = match.group(0) if match else None
        return {key: value}

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_header(self, node, visited_children):
        header = visited_children[2]
        return {"ENC_NAME": header.text}

    def visit_oa_line(self, node, visited_children):
        line, _ = visited_children
        return self.get_version("OA_VERSION", line.text)

    def visit_vc_active(self, node, visited_children):
        line, _ = visited_children
        return self.get_version("VC_ACTIVE", line.text)

    def visit_vc_stdn(self, node, visited_children):
        line, _ = visited_children
        return self.get_version("VC_STDN", line.text)

    def visit_block(self, node, visited_children):
        dct = {}
        for child in visited_children:
            if isinstance(child, dict):
                dct.update(child)
            elif isinstance(child, list):
                dct.update(child[0])
        return dct

    def visit_content(self, node, visited_children):
        return [child[0] for child in visited_children if isinstance(child[0], dict)]

enc = ENCVisitor()
result = enc.parse(data)

df = pd.DataFrame(result)
print(df)

   ENC_NAME OA_VERSION VC_ACTIVE VC_STDN
0   enc1001       4.85      4.50    4.50
1   enc1002       4.85      4.50    4.50
2   enc1003       4.85      4.50    4.50
3   enc1004       4.85      4.50    4.50
4   enc1005       4.85      4.50    4.50
..      ...        ...       ...     ...
94  enc8025       4.85      4.62    4.62
95  enc8026       4.85      4.62    4.62
96  enc8027       4.85      4.62    4.62
97  enc8028       4.85      4.62    4.62
98  enc8033       4.85      4.40    4.40

[99 rows x 4 columns]