Python 使用正则表达式将结构化但非表格文本解析为熊猫_Python_Regex_Pandas_Parsing

Python 使用正则表达式将结构化但非表格文本解析为熊猫

python regex pandas parsing

Python 使用正则表达式将结构化但非表格文本解析为熊猫,python,regex,pandas,parsing,Python,Regex,Pandas,Parsing,我正在尝试将以下数据从文本文件解析为熊猫： genome Bacteroidetes_4 reference B650 source carotenoid genome Desulfovibrio_3 reference B123 source Polyketide reference B839 source flexirubin 我希望输出像这样： genome,reference,source Bacteroidetes_4,B650,carotenoid Desulfovibrio_3

我正在尝试将以下数据从文本文件解析为熊猫：

genome Bacteroidetes_4
reference B650
source carotenoid

genome Desulfovibrio_3
reference B123
source Polyketide
reference B839
source flexirubin

我希望输出像这样：

genome,reference,source
Bacteroidetes_4,B650,carotenoid
Desulfovibrio_3,B123,Polyketide
Desulfovibrio_3,B839,flexirubin

我已经修改了一些代码（由Vipin Ajayakumar编写）

当我运行这段代码时，它会连续返回，没有结束。

如果有任何关于如何纠正或排除故障的提示，我们将不胜感激。

您是否尝试过在while循环中设置断点并使用调试器查看发生了什么

您只需使用：

breakpoint()

使用Python>=3.7。对于旧版本：

import pdb

# your code

# for each part you are
# interested in the while 
# loop:
pdb.set_trace()

然后在启用调试器的情况下运行脚本：

>>> python3 -m pdb yourscript.py

使用“c”继续到下一个断点。有关命令的详细信息，请参见

如果您使用的IDE具有集成调试器，那么也可以使用集成调试器，这样使用起来就不那么麻烦了

顺便说一下，这可能是因为您使用了

while line

，然后似乎从未读取新行，因此只要第一行不是空字符串，语句的计算结果就会为True并无限期地留在while循环中。您可以尝试使用for循环来迭代该文件

例如

您的问题是在这里读取文件时

with open(filepath, 'r') as file_object:        
    line = file_object.readline()        
    while line:

line的值永远不会改变，因此while循环会无休止地运行

更改为：

with open(filepath, 'r') as file_object: 
    lines = file_object.readlines()
    for line in lines:

仅使用熊猫，我们可以使用

str.split

df = pd.read_csv('tmp.txt',sep='|',header=None)
s = df[0].str.split(' ',expand=True)

df_new = s.set_index([0,s.groupby(0).cumcount()]).unstack(0)

谢谢，这很有道理，解决了这个问题。但它现在生成一个空数据帧，这是我以前在故障排除时看到的情况。我将更新问题。我已在此处更新了问题

with open('file.suffix', 'r') as fileobj:
    for line in fileobj:
        # your logic

with open(filepath, 'r') as file_object:        
    line = file_object.readline()        
    while line:

with open(filepath, 'r') as file_object: 
    lines = file_object.readlines()
    for line in lines:

df = pd.read_csv('tmp.txt',sep='|',header=None)
s = df[0].str.split(' ',expand=True)

df_new = s.set_index([0,s.groupby(0).cumcount()]).unstack(0)

print(df_new)

                 1                      
0           genome reference      source
0  Bacteroidetes_4      B650  carotenoid
1  Desulfovibrio_3      B123  Polyketide
2              NaN      B839  flexirubin