List 将一个csv文件读入一列就像一个列表,创建新行
我有一个csv文件,格式如下List 将一个csv文件读入一列就像一个列表,创建新行,list,python-3.x,pandas,rows,List,Python 3.x,Pandas,Rows,我有一个csv文件,格式如下 id results_numbers results creation_time 9680 2 [(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')] 11/10/14 0:23 9690 3 [(5968, u'Jacobsonl'), (47, u
id results_numbers results creation_time
9680 2 [(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')] 11/10/14 0:23
9690 3 [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')] 12/10/14 0:10
我想把这段话读到熊猫身上,并隐藏到下面:
id results_numbers new_id name creation_time
9680 2 9394 lesbyfaye 11/10/14 0:23
9680 3 999 Kayts & Koilsby 11/10/14 0:23
9690 3 5968 Jacobsonl 12/10/14 0:10
9690 3 47 SarHix 12/10/14 0:10
9690 3 8825 joy 12/10/14 0:10
假设您可以读取数据帧:
df = pd.DataFrame({'id': [9680, 9690], 'results_number': [2, 3], 'results': [[(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')], [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')]], 'creation_time': ["11/10/14 0:23", "12/10/14 0:10"]})
>>>> pd.DataFrame([[row.id, row.results_number, tup[0], tup[1], row.creation_time]
for _, row in df.iterrows()
for tup in row.results],
columns=['id', 'results_numbers', 'new_id', 'name', 'creation_time'])
id results_numbers new_id name creation_time
0 9680 2 9394 lesbyfaye 11/10/14 0:23
1 9680 2 999 Kayts & Koilsby 11/10/14 0:23
2 9690 3 5968 Jacobsonl 12/10/14 0:10
3 9690 3 47 SarHix 12/10/14 0:10
4 9690 3 8825 joy 12/10/14 0:10
编辑
如果数据格式不正确,请尝试以下操作:
good_data = []
bad_data = []
for _, row in df.iterrows():
for n, tup in enumerate(row.results):
if len(tup) == 2:
good_data.append([row.id, row.results_number, tup[0], tup[1], row.creation_time])
else:
bad_data.append(n, tup)
您也可以尝试在不循环的情况下执行此操作: 原始DF:
In [184]: df
Out[184]:
creation_time id results \
0 11/10/14 0:23 9680 [(9394, lesbyfaye), (999, Kayts & Koilsby)]
1 12/10/14 0:10 9690 [(5968, Jacobsonl), (47, SarHix), (8825, joy)]
results_number
0 2
1 3
解决方案:
In [189]: tmp = (pd.DataFrame.from_dict(df.results.to_dict(), orient='index')
.....: .stack()
.....: .reset_index(level=1, drop=True)
.....: )
In [190]: idx = tmp.index
In [191]: new = (pd.DataFrame(tmp.tolist(), columns=['new_id','name'], index=idx)
.....: .join(df.drop(['results'], axis=1))
.....: )
结果:
In [192]: new
Out[192]:
new_id name creation_time id results_number
0 9394 lesbyfaye 11/10/14 0:23 9680 2
0 999 Kayts & Koilsby 11/10/14 0:23 9680 2
1 5968 Jacobsonl 12/10/14 0:10 9690 3
1 47 SarHix 12/10/14 0:10 9690 3
1 8825 joy 12/10/14 0:10 9690 3
亚历山大,谢谢。这适用于问题中的数据集。然而,当我将其应用于整个数据集时,我得到了以下结果:IndexError:string index超出了范围OK,如果数据格式正确,您的第一个解决方案效果很好。但我发现“结果”被截断为512个字符。因此,由于截断,我可能在结尾有这样的“结果:[(47,u'SarHix'),(8825,u'joy'),…,(6582,u'tevez'),(135,u'tr')