Python 从txt构建数据帧

Python 从txt构建数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我想从a txt文件中提取一些信息。文件(名为inf.txt),用于在python中构建数据帧。inf.txt的一个例子如下: bene_id_18900 (Variable1, 43) bene_id_18900 (Variable4, 0) dtype: object 0 encrypted 723 beneficiary id (Label1, 43) encrypted 723 beneficiary id (Label5, 4) dtype: object 0

我想从a txt文件中提取一些信息。文件(名为
inf.txt
),用于在python中构建数据帧。
inf.txt
的一个例子如下:

bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3
结果是:

    1   2   3   4   5   6
0   43  na  na  0   4   na
1   na  na  na  na  na  na
2   4   5   na  95  na  94
3   na  na  na  140 na  142
行号来自数据类型:object后面的数字,列号来自每个括号中的第二个数字

例如,在第一行its(Variable1,43):它属于dtype:object 0,因此它位于第一行;变量1,因此它位于第一列中

另一个例子,在倒数第二行,its(label6142):它属于dtype:object3,所以它位于第三行;Label6,所以它在第六列

所有这些字符串,如“bene_id_18900”、“Variable”、“Label”等,实际上都没有意义

我的想法是在每个括号中添加相应的行号,这样以后我可以保留所有有用的信息并删除所有无用的信息。像这样:

(1, 43, 0)
(4, 0, 0)
(1, 43, 0)
(5, 4, 0)
(1, 43, 0)
(4, 0, 0)
(4, 95, 1)
(4, 95, 1)
......
......
......
我的努力,我真的不知道

with open('/Users/xccxken/Dropbox/inf.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content] 
for x in content:

假设您知道文本文件中的行数(M)和列数(N)。一个简单的解析来获取max-dtype和max-label(no)变量(no)将获得此信息。 接下来创建一个MxN数组

import re
import pandas as pd
# assuming that you have found the max no of rows M and max no of columns N.
M = 4
N = 6
# create MxN list of lists with values 'na'
x = ['na'] * N
data = []
for i in range(M):
    tmp = list(x)
    data.append(tmp)
index_x = -999 # fix for NameError
# data = [x] * M; this does not work since lists are mutable objects

with open('/Users/xccxken/Dropbox/inf.txt') as fh:
    for line in fh:
        line = line.strip()
        if 'dtype' in line:
            # get the x axis index
            index_x = int(line.split(' ')[-1])
        if 'Label' in line:
            # get y axis index
            c = re.search('Label(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            # reduce index_y by 1 as the col names start with 1 and python list is 0 index
            if index_y > 0:
                index_y -= 1
            # get value
            value = int(c.groups()[1])
            if index_x >= 0: # fix the NameError and a logical bug
                # populate the correct x,y location in the list of lists
                data[index_x][index_y] = value
        if 'Variable' in line:
            c = re.search('Variable(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            value = int(c.groups()[1])
            if index_y > 0:
                index_y -= 1
            if index_x >= 0: # fix the NameError and a logical bug
                data[index_x][index_y] = value
# create the col names
cols = range(1, N+1)
# create the dataframe
df = pd.DataFrame(data, columns=cols)
希望这对我有帮助,这对我有用 我以此为样本:

dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3
输出为:

    1   2   3    4   5    6
0  43  na  na   95   4   94
1  na  na  na   na  na   na
2  na  na  na  140  na  142
3  na  na  na  140  na  142

只是FIY,我认为这些也是有效数据:

dtype: object 0
from      (Variable4, 95) # is valid
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94) # is valid

谢谢!我是python的net,我复制了代码,但是第39行有一个错误,说NameError:name'index_x'没有定义,请告诉我如何修复。谢谢嗨,乔,请现在检查一下,应该可以用了。我没有初始化索引x,因此出现了NameError(因为dtype不是第一行)。