Python 从包含字典列的csv构建数据帧
我有一个csv,它包含多个列,用一个dict填充。有数千行。我想把这些dict取出,用它们的键组成列,用它们的值填充单元格,在缺少值的地方填充NaN。以便:Python 从包含字典列的csv构建数据帧,python,csv,pandas,dictionary,Python,Csv,Pandas,Dictionary,我有一个csv,它包含多个列,用一个dict填充。有数千行。我想把这些dict取出,用它们的键组成列,用它们的值填充单元格,在缺少值的地方填充NaN。以便: id attributes 0 255RSSSTCHL-QLTDGLZD-BLK {"color": "Black", "hardware": "Goldtone"} 1 C3ACCRDNFLP-QLTDS-S-BLK {"size": "Small",
id attributes
0 255RSSSTCHL-QLTDGLZD-BLK {"color": "Black", "hardware": "Goldtone"}
1 C3ACCRDNFLP-QLTDS-S-BLK {"size": "Small", "color": "Black"}
变成:
id size color hardware
0 255RSSSTCHL-QLTDGLZD-BLK NaN Black Goldtone
1 C3ACCRDNFLP-QLTDS-S-BLK Small Black NaN
有几个像“id”这样的列,我希望在生成的数据帧中保持不变,还有几个像“attributes”这样的列,它们用dict填充,我希望将它们吹到列中。我将它们截断到上面的示例中进行说明。源DF:
In [172]: df
Out[172]:
id attributes attr2
0 255RSSSTCHL-QLTDGLZD-BLK {"color":"Black","hardware":"Goldtone"} {"aaa":"aaa", "bbb":"bbb"}
1 C3ACCRDNFLP-QLTDS-S-BLK {"size":"Small","color":"Black"} {"ccc":"ccc"}
import ast
attr_cols = ['attributes','attr2']
def f(df, attr_col):
return df.join(df.pop(attr_col) \
.apply(lambda x: pd.Series(ast.literal_eval(x))))
for col in attr_cols:
df = f(df, col)
In [175]: df
Out[175]:
id color hardware size aaa bbb ccc
0 255RSSSTCHL-QLTDGLZD-BLK Black Goldtone NaN aaa bbb NaN
1 C3ACCRDNFLP-QLTDS-S-BLK Black NaN Small NaN NaN ccc
解决方案1:
In [172]: df
Out[172]:
id attributes attr2
0 255RSSSTCHL-QLTDGLZD-BLK {"color":"Black","hardware":"Goldtone"} {"aaa":"aaa", "bbb":"bbb"}
1 C3ACCRDNFLP-QLTDS-S-BLK {"size":"Small","color":"Black"} {"ccc":"ccc"}
import ast
attr_cols = ['attributes','attr2']
def f(df, attr_col):
return df.join(df.pop(attr_col) \
.apply(lambda x: pd.Series(ast.literal_eval(x))))
for col in attr_cols:
df = f(df, col)
In [175]: df
Out[175]:
id color hardware size aaa bbb ccc
0 255RSSSTCHL-QLTDGLZD-BLK Black Goldtone NaN aaa bbb NaN
1 C3ACCRDNFLP-QLTDS-S-BLK Black NaN Small NaN NaN ccc
解决方案2:感谢:
结果:
In [172]: df
Out[172]:
id attributes attr2
0 255RSSSTCHL-QLTDGLZD-BLK {"color":"Black","hardware":"Goldtone"} {"aaa":"aaa", "bbb":"bbb"}
1 C3ACCRDNFLP-QLTDS-S-BLK {"size":"Small","color":"Black"} {"ccc":"ccc"}
import ast
attr_cols = ['attributes','attr2']
def f(df, attr_col):
return df.join(df.pop(attr_col) \
.apply(lambda x: pd.Series(ast.literal_eval(x))))
for col in attr_cols:
df = f(df, col)
In [175]: df
Out[175]:
id color hardware size aaa bbb ccc
0 255RSSSTCHL-QLTDGLZD-BLK Black Goldtone NaN aaa bbb NaN
1 C3ACCRDNFLP-QLTDS-S-BLK Black NaN Small NaN NaN ccc
计时:对于20000行DF:
In [198]: df = pd.concat([df] * 10**4, ignore_index=True)
In [199]: df.shape
Out[199]: (20000, 3)
In [201]: %paste
def f_ast(df, attr_col):
return df.join(df.pop(attr_col) \
.apply(lambda x: pd.Series(ast.literal_eval(x))))
def f_json(df, attr_col):
return df.join(df.pop(attr_col) \
.apply(lambda x: pd.Series(json.loads(x))))
## -- End pasted text --
In [202]: %%timeit
...: for col in attr_cols:
...: f_ast(df.copy(), col)
...:
1 loop, best of 3: 33.1 s per loop
In [203]:
In [203]: %%timeit
...: for col in attr_cols:
...: f_json(df.copy(), col)
...:
1 loop, best of 3: 30 s per loop
In [204]: df.shape
Out[204]: (20000, 3)
您可以使用
转换器
选项将字符串解析嵌入到pd.read\u csv
调用中
import pandas as pd
from io import StringIO
from cytoolz.dicttoolz import merge as dmerge
from json import loads
txt = """id|attributes|attr2
255RSSSTCHL-QLTDGLZD-BLK|{"color":"Black","hardware":"Goldtone"}|{"aaa":"aaa", "bbb":"bbb"}
C3ACCRDNFLP-QLTDS-S-BLK|{"size":"Small","color":"Black"}|{"ccc":"ccc"}"""
converters = dict(attributes=loads, attr2=loads)
df = pd.read_csv(StringIO(txt), sep='|', index_col='id', converters=converters)
df
然后我们可以合并
每行的字典,并将其转换为pd.DataFrame
。我将使用上面作为dmerge导入的cytoolz.dicttoolz.merge
pd.DataFrame(df.apply(dmerge, 1).values.tolist(), df.index).reset_index()
id aaa bbb ccc color hardware size
0 255RSSSTCHL-QLTDGLZD-BLK aaa bbb NaN Black Goldtone NaN
1 C3ACCRDNFLP-QLTDS-S-BLK NaN NaN ccc Black NaN Small
如果字典也是有效的JSON对象,那么JSON.loads
比ast.literal\u eval
@DYZ快5%左右,我添加了一个计时-对于该DF,它快了10%;)