List 将一个csv文件读入一列就像一个列表,创建新行

List 将一个csv文件读入一列就像一个列表,创建新行,list,python-3.x,pandas,rows,List,Python 3.x,Pandas,Rows,我有一个csv文件,格式如下 id results_numbers results creation_time 9680 2 [(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')] 11/10/14 0:23 9690 3 [(5968, u'Jacobsonl'), (47, u

我有一个csv文件,格式如下

id  results_numbers results                                                  creation_time
9680    2           [(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')]        11/10/14 0:23
9690    3           [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')]  12/10/14 0:10
我想把这段话读到熊猫身上,并隐藏到下面:

id     results_numbers  new_id name              creation_time
9680    2               9394   lesbyfaye         11/10/14 0:23
9680    3                999   Kayts & Koilsby   11/10/14 0:23
9690    3               5968   Jacobsonl         12/10/14 0:10
9690    3                 47   SarHix            12/10/14 0:10
9690    3               8825   joy               12/10/14 0:10

假设您可以读取数据帧:

df = pd.DataFrame({'id': [9680, 9690], 'results_number': [2, 3], 'results': [[(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')], [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')]], 'creation_time': ["11/10/14 0:23", "12/10/14 0:10"]})

>>>> pd.DataFrame([[row.id, row.results_number, tup[0], tup[1], row.creation_time] 
                   for _, row in df.iterrows() 
                   for tup in row.results], 
                  columns=['id', 'results_numbers', 'new_id', 'name', 'creation_time'])

     id  results_numbers  new_id             name  creation_time
0  9680                2    9394        lesbyfaye  11/10/14 0:23
1  9680                2     999  Kayts & Koilsby  11/10/14 0:23
2  9690                3    5968        Jacobsonl  12/10/14 0:10
3  9690                3      47           SarHix  12/10/14 0:10
4  9690                3    8825              joy  12/10/14 0:10
编辑

如果数据格式不正确,请尝试以下操作:

good_data = []
bad_data = []
for _, row in df.iterrows():
    for n, tup in enumerate(row.results):
        if len(tup) == 2:
            good_data.append([row.id, row.results_number, tup[0], tup[1], row.creation_time])
        else:
            bad_data.append(n, tup)

您也可以尝试在不循环的情况下执行此操作:

原始DF:

In [184]: df
Out[184]:
   creation_time    id                                         results  \
0  11/10/14 0:23  9680     [(9394, lesbyfaye), (999, Kayts & Koilsby)]
1  12/10/14 0:10  9690  [(5968, Jacobsonl), (47, SarHix), (8825, joy)]

   results_number
0               2
1               3
解决方案:

In [189]: tmp = (pd.DataFrame.from_dict(df.results.to_dict(), orient='index')
   .....:          .stack()
   .....:          .reset_index(level=1, drop=True)
   .....:       )

In [190]: idx = tmp.index

In [191]: new = (pd.DataFrame(tmp.tolist(), columns=['new_id','name'], index=idx)
   .....:          .join(df.drop(['results'], axis=1))
   .....:       )
结果:

In [192]: new
Out[192]:
   new_id             name  creation_time    id  results_number
0    9394        lesbyfaye  11/10/14 0:23  9680               2
0     999  Kayts & Koilsby  11/10/14 0:23  9680               2
1    5968        Jacobsonl  12/10/14 0:10  9690               3
1      47           SarHix  12/10/14 0:10  9690               3
1    8825              joy  12/10/14 0:10  9690               3

亚历山大,谢谢。这适用于问题中的数据集。然而,当我将其应用于整个数据集时,我得到了以下结果:IndexError:string index超出了范围OK,如果数据格式正确,您的第一个解决方案效果很好。但我发现“结果”被截断为512个字符。因此,由于截断,我可能在结尾有这样的“结果:[(47,u'SarHix'),(8825,u'joy'),…,(6582,u'tevez'),(135,u'tr')