Python 从txt构建数据帧
我想从a txt文件中提取一些信息。文件(名为Python 从txt构建数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我想从a txt文件中提取一些信息。文件(名为inf.txt),用于在python中构建数据帧。inf.txt的一个例子如下: bene_id_18900 (Variable1, 43) bene_id_18900 (Variable4, 0) dtype: object 0 encrypted 723 beneficiary id (Label1, 43) encrypted 723 beneficiary id (Label5, 4) dtype: object 0
inf.txt
),用于在python中构建数据帧。inf.txt
的一个例子如下:
bene_id_18900 (Variable1, 43)
bene_id_18900 (Variable4, 0)
dtype: object 0
encrypted 723 beneficiary id (Label1, 43)
encrypted 723 beneficiary id (Label5, 4)
dtype: object 0
bene_id_18900 (Variable1, 43)
bene_id_18900 (Variable4, 0)
dtype: object 0
from (Variable4, 95)
from (VNAME4, 95)
from (Variable6, 94)
from (VNAME6, 94)
dtype: object 2
first day on claim billing statement (Label4, 95)
first day on claim billing statement (Label6, 94)
dtype: object 2
thru (Variable4, 140)
thru (VNAME4, 140)
thru (Variable6, 142)
thru (VNAME6, 142)
dtype: object 3
last day on claim billing statement (Label4, 140)
last day on claim billing statement (Label6, 142)
dtype: object 3
结果是:
1 2 3 4 5 6
0 43 na na 0 4 na
1 na na na na na na
2 4 5 na 95 na 94
3 na na na 140 na 142
行号来自数据类型:object后面的数字,列号来自每个括号中的第二个数字
例如,在第一行its(Variable1,43):它属于dtype:object 0,因此它位于第一行;变量1,因此它位于第一列中
另一个例子,在倒数第二行,its(label6142):它属于dtype:object3,所以它位于第三行;Label6,所以它在第六列
所有这些字符串,如“bene_id_18900”、“Variable”、“Label”等,实际上都没有意义
我的想法是在每个括号中添加相应的行号,这样以后我可以保留所有有用的信息并删除所有无用的信息。像这样:
(1, 43, 0)
(4, 0, 0)
(1, 43, 0)
(5, 4, 0)
(1, 43, 0)
(4, 0, 0)
(4, 95, 1)
(4, 95, 1)
......
......
......
我的努力,我真的不知道
with open('/Users/xccxken/Dropbox/inf.txt') as f:
content = f.readlines()
content = [x.strip() for x in content]
for x in content:
假设您知道文本文件中的行数(M)和列数(N)。一个简单的解析来获取max-dtype和max-label(no)变量(no)将获得此信息。 接下来创建一个MxN数组
import re
import pandas as pd
# assuming that you have found the max no of rows M and max no of columns N.
M = 4
N = 6
# create MxN list of lists with values 'na'
x = ['na'] * N
data = []
for i in range(M):
tmp = list(x)
data.append(tmp)
index_x = -999 # fix for NameError
# data = [x] * M; this does not work since lists are mutable objects
with open('/Users/xccxken/Dropbox/inf.txt') as fh:
for line in fh:
line = line.strip()
if 'dtype' in line:
# get the x axis index
index_x = int(line.split(' ')[-1])
if 'Label' in line:
# get y axis index
c = re.search('Label(\d), (\d+)', line)
index_y = int(c.groups()[0])
# reduce index_y by 1 as the col names start with 1 and python list is 0 index
if index_y > 0:
index_y -= 1
# get value
value = int(c.groups()[1])
if index_x >= 0: # fix the NameError and a logical bug
# populate the correct x,y location in the list of lists
data[index_x][index_y] = value
if 'Variable' in line:
c = re.search('Variable(\d), (\d+)', line)
index_y = int(c.groups()[0])
value = int(c.groups()[1])
if index_y > 0:
index_y -= 1
if index_x >= 0: # fix the NameError and a logical bug
data[index_x][index_y] = value
# create the col names
cols = range(1, N+1)
# create the dataframe
df = pd.DataFrame(data, columns=cols)
希望这对我有帮助,这对我有用
我以此为样本:
dtype: object 0
encrypted 723 beneficiary id (Label1, 43)
encrypted 723 beneficiary id (Label5, 4)
dtype: object 0
bene_id_18900 (Variable1, 43)
bene_id_18900 (Variable4, 0)
dtype: object 0
from (Variable4, 95)
from (VNAME4, 95)
from (Variable6, 94)
from (VNAME6, 94)
dtype: object 2
first day on claim billing statement (Label4, 95)
first day on claim billing statement (Label6, 94)
dtype: object 2
thru (Variable4, 140)
thru (VNAME4, 140)
thru (Variable6, 142)
thru (VNAME6, 142)
dtype: object 3
last day on claim billing statement (Label4, 140)
last day on claim billing statement (Label6, 142)
dtype: object 3
输出为:
1 2 3 4 5 6
0 43 na na 95 4 94
1 na na na na na na
2 na na na 140 na 142
3 na na na 140 na 142
只是FIY,我认为这些也是有效数据:
dtype: object 0
from (Variable4, 95) # is valid
from (VNAME4, 95)
from (Variable6, 94)
from (VNAME6, 94) # is valid
谢谢!我是python的net,我复制了代码,但是第39行有一个错误,说NameError:name'index_x'没有定义,请告诉我如何修复。谢谢嗨,乔,请现在检查一下,应该可以用了。我没有初始化索引x,因此出现了NameError(因为dtype不是第一行)。