Python 将数据帧列表转换(转置)为列

Python 将数据帧列表转换(转置)为列,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有一个pandas数据框,在单元格中有一个值列表。如果列值在该行的列表中,我需要将这些值转换为包含true或false的列。我需要为每行列表中的每个唯一值设置一列 这是我的数据帧: data = [ {"agency_id": 1,"province": ["CH", "PE"]}, {"agency_id": 3,"province": ["CH", "CS"]} ] df = pd.DataFrame(data) agency_id

我有一个pandas数据框,在单元格中有一个值列表。如果列值在该行的列表中,我需要将这些值转换为包含true或false的列。我需要为每行列表中的每个唯一值设置一列

这是我的数据帧:

data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
df = pd.DataFrame(data)

   agency_id                          province
0          1                  [CH, PE]
1          3                          [CH, CS]
创建初始数据帧

然后我试着:

df2 = pd.DataFrame(df['province'].values.tolist(),index=df['agency_id'])
它的输出是:

 0     1     2     3     4     5     6     7
agency_id                                                
1            CH    PE    AQ    TE  None  None  None  None
3            KR    CS  None  None  None  None  None  None
7            FE    FC    BO    MO    RA    RE    RN    PR
8          None  None  None  None  None  None  None  None
10           RM  None  None  None  None  None  None  None
11           RM  None  None  None  None  None  None  None
但这不是我想要的,因为列没有“对齐”

我需要这样的东西:

agency_id CH PE CS
1 true true false
3 true false true

sklearn
MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df['province']),columns=mlb.classes_, index=df.agency_id).astype(bool)
Out[90]: 
             CH     CS     PE
agency_id                    
1          True  False   True
3          True   True  False

如果您不想从sklearn导入
,则可以清理/修改
数据。预处理导入MultiLabelBinarizer
为此:

import pandas as pd

data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]

# get all provinces from any included dictionaries of data:
all_prov = sorted(set( (x for y in [d["province"] for d in data] for x in y) ))

# add the missing key:values to your data's dicts:
for d in data:
    for p in all_prov:
        d[p] = p in d["province"]

print(data)

df = pd.DataFrame(data)
print(df)
输出:

# data
[{'agency_id': 1, 'province': ['CH', 'PE'], 'CH': True, 'CS': False, 'PE': True}, 
 {'agency_id': 3, 'province': ['CH', 'CS'], 'CH': True, 'CS': True, 'PE': False}]

# df 
     CH     CS     PE  agency_id  province
0  True  False   True          1  [CH, PE]
1  True   True  False          3  [CH, CS] 

另一种解决方案,只需使用
pandas

import pandas as pd

data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
df = pd.DataFrame(data)

result = df['province'].apply(lambda x: '|'.join(x)).str.get_dummies().astype(bool).set_index(df.agency_id)
print(result)
输出

             CH     CS     PE
agency_id                    
1          True  False   True
3          True   True  False

接受,因为这是迄今为止最完整的答案。非常感谢。