Python 从txt构建数据帧_Python_Pandas_Dataframe

Python 从txt构建数据帧

python pandas dataframe

Python 从txt构建数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我想从a txt文件中提取一些信息。文件（名为inf.txt），用于在python中构建数据帧。inf.txt的一个例子如下： bene_id_18900 (Variable1, 43) bene_id_18900 (Variable4, 0) dtype: object 0 encrypted 723 beneficiary id (Label1, 43) encrypted 723 beneficiary id (Label5, 4) dtype: object 0

我想从a txt文件中提取一些信息。文件（名为

inf.txt

），用于在python中构建数据帧。

inf.txt

的一个例子如下：

bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3

结果是：

    1   2   3   4   5   6
0   43  na  na  0   4   na
1   na  na  na  na  na  na
2   4   5   na  95  na  94
3   na  na  na  140 na  142

行号来自数据类型：object后面的数字，列号来自每个括号中的第二个数字

例如，在第一行its（Variable1，43）：它属于dtype:object 0，因此它位于第一行；变量1，因此它位于第一列中

另一个例子，在倒数第二行，its（label6142）：它属于dtype:object3，所以它位于第三行；Label6，所以它在第六列

所有这些字符串，如“bene_id_18900”、“Variable”、“Label”等，实际上都没有意义

我的想法是在每个括号中添加相应的行号，这样以后我可以保留所有有用的信息并删除所有无用的信息。像这样：

(1, 43, 0)
(4, 0, 0)
(1, 43, 0)
(5, 4, 0)
(1, 43, 0)
(4, 0, 0)
(4, 95, 1)
(4, 95, 1)
......
......
......

我的努力，我真的不知道

with open('/Users/xccxken/Dropbox/inf.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content] 
for x in content:

假设您知道文本文件中的行数（M）和列数（N）。一个简单的解析来获取max-dtype和max-label（no）变量（no）将获得此信息。接下来创建一个MxN数组

import re
import pandas as pd
# assuming that you have found the max no of rows M and max no of columns N.
M = 4
N = 6
# create MxN list of lists with values 'na'
x = ['na'] * N
data = []
for i in range(M):
    tmp = list(x)
    data.append(tmp)
index_x = -999 # fix for NameError
# data = [x] * M; this does not work since lists are mutable objects

with open('/Users/xccxken/Dropbox/inf.txt') as fh:
    for line in fh:
        line = line.strip()
        if 'dtype' in line:
            # get the x axis index
            index_x = int(line.split(' ')[-1])
        if 'Label' in line:
            # get y axis index
            c = re.search('Label(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            # reduce index_y by 1 as the col names start with 1 and python list is 0 index
            if index_y > 0:
                index_y -= 1
            # get value
            value = int(c.groups()[1])
            if index_x >= 0: # fix the NameError and a logical bug
                # populate the correct x,y location in the list of lists
                data[index_x][index_y] = value
        if 'Variable' in line:
            c = re.search('Variable(\d), (\d+)', line)
            index_y = int(c.groups()[0])
            value = int(c.groups()[1])
            if index_y > 0:
                index_y -= 1
            if index_x >= 0: # fix the NameError and a logical bug
                data[index_x][index_y] = value
# create the col names
cols = range(1, N+1)
# create the dataframe
df = pd.DataFrame(data, columns=cols)

希望这对我有帮助，这对我有用我以此为样本：

dtype: object 0
encrypted 723 beneficiary id    (Label1, 43)
encrypted 723 beneficiary id    (Label5, 4)
dtype: object 0
bene_id_18900    (Variable1, 43)
bene_id_18900    (Variable4, 0)
dtype: object 0
from      (Variable4, 95)
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94)
dtype: object 2
first day on claim billing statement      (Label4, 95)
first day on claim billing statement      (Label6, 94)
dtype: object 2
thru     (Variable4, 140)
thru        (VNAME4, 140)
thru     (Variable6, 142)
thru        (VNAME6, 142)
dtype: object 3
last day on claim billing statement     (Label4, 140)
last day on claim billing statement     (Label6, 142)
dtype: object 3

输出为：

    1   2   3    4   5    6
0  43  na  na   95   4   94
1  na  na  na   na  na   na
2  na  na  na  140  na  142
3  na  na  na  140  na  142

只是FIY，我认为这些也是有效数据：

dtype: object 0
from      (Variable4, 95) # is valid
from         (VNAME4, 95)
from      (Variable6, 94)
from         (VNAME6, 94) # is valid

谢谢！我是python的net，我复制了代码，但是第39行有一个错误，说NameError:name'index_x'没有定义，请告诉我如何修复。谢谢嗨，乔，请现在检查一下，应该可以用了。我没有初始化索引x，因此出现了NameError（因为dtype不是第一行）。