Python 从包含字典列的csv构建数据帧

Python 从包含字典列的csv构建数据帧,python,csv,pandas,dictionary,Python,Csv,Pandas,Dictionary,我有一个csv,它包含多个列,用一个dict填充。有数千行。我想把这些dict取出,用它们的键组成列,用它们的值填充单元格,在缺少值的地方填充NaN。以便: id attributes 0 255RSSSTCHL-QLTDGLZD-BLK {"color": "Black", "hardware": "Goldtone"} 1 C3ACCRDNFLP-QLTDS-S-BLK {"size": "Small",

我有一个csv,它包含多个列,用一个dict填充。有数千行。我想把这些dict取出,用它们的键组成列,用它们的值填充单元格,在缺少值的地方填充NaN。以便:

   id                            attributes
0   255RSSSTCHL-QLTDGLZD-BLK     {"color": "Black", "hardware": "Goldtone"}
1   C3ACCRDNFLP-QLTDS-S-BLK      {"size": "Small", "color": "Black"}
变成:

   id                            size   color   hardware  
0   255RSSSTCHL-QLTDGLZD-BLK     NaN    Black   Goldtone
1   C3ACCRDNFLP-QLTDS-S-BLK      Small  Black   NaN

有几个像“id”这样的列,我希望在生成的数据帧中保持不变,还有几个像“attributes”这样的列,它们用dict填充,我希望将它们吹到列中。我将它们截断到上面的示例中进行说明。

源DF:

In [172]: df
Out[172]:
                         id                               attributes                       attr2
0  255RSSSTCHL-QLTDGLZD-BLK  {"color":"Black","hardware":"Goldtone"}  {"aaa":"aaa", "bbb":"bbb"}
1   C3ACCRDNFLP-QLTDS-S-BLK         {"size":"Small","color":"Black"}               {"ccc":"ccc"}
import ast

attr_cols = ['attributes','attr2']

def f(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(ast.literal_eval(x))))


for col in attr_cols:
    df = f(df, col)
In [175]: df
Out[175]:
                         id  color  hardware   size  aaa  bbb  ccc
0  255RSSSTCHL-QLTDGLZD-BLK  Black  Goldtone    NaN  aaa  bbb  NaN
1   C3ACCRDNFLP-QLTDS-S-BLK  Black       NaN  Small  NaN  NaN  ccc
解决方案1:

In [172]: df
Out[172]:
                         id                               attributes                       attr2
0  255RSSSTCHL-QLTDGLZD-BLK  {"color":"Black","hardware":"Goldtone"}  {"aaa":"aaa", "bbb":"bbb"}
1   C3ACCRDNFLP-QLTDS-S-BLK         {"size":"Small","color":"Black"}               {"ccc":"ccc"}
import ast

attr_cols = ['attributes','attr2']

def f(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(ast.literal_eval(x))))


for col in attr_cols:
    df = f(df, col)
In [175]: df
Out[175]:
                         id  color  hardware   size  aaa  bbb  ccc
0  255RSSSTCHL-QLTDGLZD-BLK  Black  Goldtone    NaN  aaa  bbb  NaN
1   C3ACCRDNFLP-QLTDS-S-BLK  Black       NaN  Small  NaN  NaN  ccc
解决方案2:感谢:

结果:

In [172]: df
Out[172]:
                         id                               attributes                       attr2
0  255RSSSTCHL-QLTDGLZD-BLK  {"color":"Black","hardware":"Goldtone"}  {"aaa":"aaa", "bbb":"bbb"}
1   C3ACCRDNFLP-QLTDS-S-BLK         {"size":"Small","color":"Black"}               {"ccc":"ccc"}
import ast

attr_cols = ['attributes','attr2']

def f(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(ast.literal_eval(x))))


for col in attr_cols:
    df = f(df, col)
In [175]: df
Out[175]:
                         id  color  hardware   size  aaa  bbb  ccc
0  255RSSSTCHL-QLTDGLZD-BLK  Black  Goldtone    NaN  aaa  bbb  NaN
1   C3ACCRDNFLP-QLTDS-S-BLK  Black       NaN  Small  NaN  NaN  ccc
计时:对于20000行DF:

In [198]: df = pd.concat([df] * 10**4, ignore_index=True)

In [199]: df.shape
Out[199]: (20000, 3)

In [201]: %paste
def f_ast(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(ast.literal_eval(x))))

def f_json(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(json.loads(x))))
## -- End pasted text --

In [202]: %%timeit
     ...: for col in attr_cols:
     ...:     f_ast(df.copy(), col)
     ...:
1 loop, best of 3: 33.1 s per loop

In [203]:

In [203]: %%timeit
     ...: for col in attr_cols:
     ...:     f_json(df.copy(), col)
     ...:
1 loop, best of 3: 30 s per loop

In [204]: df.shape
Out[204]: (20000, 3)

您可以使用
转换器
选项将字符串解析嵌入到
pd.read\u csv
调用中

import pandas as pd
from io import StringIO
from cytoolz.dicttoolz import merge as dmerge
from json import loads

txt = """id|attributes|attr2
255RSSSTCHL-QLTDGLZD-BLK|{"color":"Black","hardware":"Goldtone"}|{"aaa":"aaa", "bbb":"bbb"}
C3ACCRDNFLP-QLTDS-S-BLK|{"size":"Small","color":"Black"}|{"ccc":"ccc"}"""

converters = dict(attributes=loads, attr2=loads)

df = pd.read_csv(StringIO(txt), sep='|', index_col='id', converters=converters)
df

然后我们可以
合并
每行的字典,并将其转换为
pd.DataFrame
。我将使用上面作为
dmerge导入的
cytoolz.dicttoolz.merge

pd.DataFrame(df.apply(dmerge, 1).values.tolist(), df.index).reset_index()

                         id  aaa  bbb  ccc  color  hardware   size
0  255RSSSTCHL-QLTDGLZD-BLK  aaa  bbb  NaN  Black  Goldtone    NaN
1   C3ACCRDNFLP-QLTDS-S-BLK  NaN  NaN  ccc  Black       NaN  Small

如果字典也是有效的JSON对象,那么
JSON.loads
ast.literal\u eval
@DYZ快5%左右,我添加了一个计时-对于该DF,它快了10%;)