Python 如何对每行都有列表的数据帧进行热编码_Python_Pandas_One Hot Encoding

Python 如何对每行都有列表的数据帧进行热编码

python pandas

Python 如何对每行都有列表的数据帧进行热编码,python,pandas,one-hot-encoding,Python,Pandas,One Hot Encoding,我试图将数据列表中包含列表的数据输入机器学习算法：例如，一名患者可能有几种药物和对药物的几种反应，他们也可能有名字。因此，如果他们服用超过一种药物，就会出现2种或更多药物的列表。他们只有一个名字我相信一种热编码是正确的方法以下是我迄今为止所做的工作：我有一个数据帧： df = pandas.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, {'drug': ['drugC','drugD'], 'patient': 'a

我试图将数据列表中包含列表的数据输入机器学习算法：

例如，一名患者可能有几种药物和对药物的几种反应，他们也可能有名字。因此，如果他们服用超过一种药物，就会出现2种或更多药物的列表。他们只有一个名字

我相信一种热编码是正确的方法

以下是我迄今为止所做的工作：

我有一个数据帧：

df = pandas.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, {'drug': ['drugC','drugD'], 'patient': 'angel'}])

             drug patient
0  [drugA, drugB]    john
1  [drugC, drugD]   angel

我想得到这样的东西：

  drugA  drugB drugC drugD patient
0  1       1     0     0     john
0  0       0     1     1     angel

我试过这个：

pandas.get_dummies(df.apply(pandas.Series).stack()).sum(level=0)

但是得到：

TypeError: unhashable type: 'list'

充分利用这一点，这里有一种方法：

df = pd.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, 
                   {'drug': ['drugC','drugD'], 'patient': 'angel'}])
s = df.drug
      .apply(lambda x: pd.Series(x))
      .unstack()
df2 = df.join(pd.DataFrame(s.reset_index(level=0, drop=True)))
        .drop('drug',1)
        .rename(columns={0:'drug'})
df2.merge(pd.get_dummies(df2.drug), left_index=True, right_index=True)
   .drop('drug',1)

输出：

  patient  drugA  drugB  drugC  drugD
0    john    1.0    0.0    0.0    0.0
0    john    0.0    1.0    0.0    0.0
0    john    1.0    0.0    0.0    0.0
0    john    0.0    1.0    0.0    0.0
1   angel    0.0    0.0    1.0    0.0
1   angel    0.0    0.0    0.0    1.0
1   angel    0.0    0.0    1.0    0.0
1   angel    0.0    0.0    0.0    1.0

使用：

用于提取列或省略它并使用
通过和创建新的数据帧
+
原创

+
原创

df1 = pd.get_dummies(pd.DataFrame(df.pop('drug').values.tolist()), prefix='', prefix_sep='')
        .groupby(axis=1, level=0).max()

df1 = pd.concat([df1, df], axis=1)
print (df1)
   drugA  drugB  drugC  drugD patient
0      1      1      0      0    john
1      0      0      1      1   angel

df1 = pd.get_dummies(pd.DataFrame(df['drug'].values.tolist()), prefix='', prefix_sep='') \
        .groupby(axis=1, level=0).max()

df1 = pd.concat([df1, df.drop('drug', axis=1)], axis=1)
print (df1)
   drugA  drugB  drugC  drugD patient
0      1      1      0      0    john
1      0      0      1      1   angel

df1 = df.pop('drug').astype(str).replace(['\[','\]', "'", "\s+"], '', regex=True)
                .str.get_dummies(',')
df1 = pd.concat([df1, df], axis=1)
print (df1)
   drugA  drugB  drugC  drugD patient
0      1      1      0      0    john
1      0      0      1      1   angel

df1 = df['drug'].astype(str).replace(['\[','\]', "'", "\s+"], '', regex=True)
                .str.get_dummies(',')
df1 = pd.concat([df1, df.drop('drug', axis=1)], axis=1)
print (df1)
   drugA  drugB  drugC  drugD patient
0      1      1      0      0    john
1      0      0      1      1   angel