Python 将列表的列分隔为单独的列问题_Python_Numpy_Encoding_Scikit Learn

Python 将列表的列分隔为单独的列问题

python numpy encoding scikit-learn

Python 将列表的列分隔为单独的列问题,python,numpy,encoding,scikit-learn,Python,Numpy,Encoding,Scikit Learn,传入数据是0+个类别的列表： #input data frame df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))}) categories 0 [A, B, C] 1 [B, C] 2 [A] 我想将其转换为一个数据帧，每个类别有一列，每个单元格有一个0/1： #desired output A B C 0 1 1 1 1 0 1 1 2 1 0

传入数据是0+个类别的列表：

#input data frame
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})

  categories
0  [A, B, C]
1     [B, C]
2        [A]

我想将其转换为一个数据帧，每个类别有一列，每个单元格有一个0/1：

#desired output

   A  B  C
0  1  1  1
1  0  1  1
2  1  0  0

企图带有LabelEncoder的OneHotEncoder会被卡住，因为它们不处理单元格中的列表。目前，通过嵌套的

for

循环可获得所需的结果：

#get unique categories ['A','B','C']
categories = np.unique(np.concatenate(x['categories']))

#make empty data frame
binary_df = pd.DataFrame(columns=[c for c in categories],
                         index=x.index)

print(binary_df)
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN


#fill data frame
for i in binary_df.index:
    for c in categories:
        binary_df.loc[i][c] = 1 if c in np.concatenate(x.loc[i]) else 0

我担心的是，循环表明这是一种处理大型数据集（数十个类别、上万行或更多行）的极为低效的方法

有没有办法使用内置的Numpy/Scikit函数实现结果？

您可以尝试使用map追加行，这样，如果该列出现在input

dataframe行中，则默认情况下map将设置为0
，并更新为1

#input data frame
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})
print(df)

输出：
   categories
0  [A, B, C]
1     [B, C]
2        [A]

     A    B    C
0  1.0  1.0  1.0
1  0.0  1.0  1.0
2  1.0  0.0  0.0

对于输出数据帧

：

categories = np.unique(np.concatenate(df['categories']))
#make empty data frame
binary_df = pd.DataFrame(columns=[c for c in categories],
                     index=df.index).dropna()

for index, row in df.iterrows():
    row_elements = row['categories']
    default_row = {item:0 for item in categories}
    # update corresponding row value by updating dictionary
    for i in row_elements:
        default_row[i] = 1
    binary_df = binary_df.append(default_row, ignore_index=True)

print(binary_df)

输出：

   categories
0  [A, B, C]
1     [B, C]
2        [A]

     A    B    C
0  1.0  1.0  1.0
1  0.0  1.0  1.0
2  1.0  0.0  0.0

解决方案：工作原理：获取转换为数据帧的一系列列表

pd.DataFrame(df['categories'].tolist()).stack()
Out[101]: 
0  0    A
   1    B
   2    C
1  0    B
   1    C
2  0    A
dtype: object

准备

get_dummies

，同时保留索引供以后使用

pd.get_dummies(pd.DataFrame(df['categories'].tolist()).stack())
Out[102]: 
     A  B  C
0 0  1  0  0
  1  0  1  0
  2  0  0  1
1 0  0  1  0
  1  0  0  1
2 0  1  0  0

几乎存在，但在初始列表中包含值索引的垃圾信息

所以上面的解在这个多重指数的水平上求和

编辑：

%timeit

结果：

关于原始数据帧

df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})

所提供的解决方案：

100个循环，最好3个：每个循环3.24毫秒

此解决方案：

100圈，最佳3圈：每圈2.29毫秒

300行

df = pd.concat(100*[df]).reset_index(drop=True)

所提供的解决方案：

1圈，最佳3圈：每个圈252毫秒

此解决方案：

100圈，每圈最好3:2.45毫秒

避免在评论中回答问题。出色的细分。我从来没有想到，仅凭名字，

get_dummies

就可以成为这个目的的函数。谢谢@user1717828是的，我认为它是“获取虚拟变量”的缩写，这可能更直观。