Python 为DataFrame中的每个类别创建列_Python_Pandas_Categorical Data

Python 为DataFrame中的每个类别创建列

python pandas

Python 为DataFrame中的每个类别创建列,python,pandas,categorical-data,Python,Pandas,Categorical Data,我有一个数据框，其中有许多列，其中的二进制文件表示观察中存在的类别。每个观察值正好有3个类别，值为1，其余为0。我想创建3个新列，每个类别1个，如果值等于1，则值为类别的名称（因此是二进制列的名称）。为了更清楚，请执行以下操作：我有： x|y|z|k|w --------- 0|1|1|0|1 将是： cat1|cat2|cat3 -------------- y |z |w 我可以这样做吗？这里有一种方法： import pandas as pd df = pd.DataFra

我有一个数据框，其中有许多列，其中的二进制文件表示观察中存在的类别。每个观察值正好有3个类别，值为1，其余为0。我想创建3个新列，每个类别1个，如果值等于1，则值为类别的名称（因此是二进制列的名称）。为了更清楚，请执行以下操作：

我有：

x|y|z|k|w
---------
0|1|1|0|1

将是：

cat1|cat2|cat3
--------------
y   |z   |w

我可以这样做吗？

这里有一种方法：

import pandas as pd

df = pd.DataFrame({'x': [0, 1], 'y': [1, 1], 'z': [1, 0], 'k': [0, 1], 'w': [1, 1]})

split = df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()
df2 = pd.DataFrame(split)

#    0  1  2     3
# 0  w  y  z  None
# 1  k  w  x     y

你可以

In [13]: pd.DataFrame([df.columns[df.astype(bool).values[0]]]).add_prefix('cat')
Out[13]:
  cat0 cat1 cat2
0    y    z    w

要获得更好的性能，请使用numpy解决方案：

print (df)
   x  y  z  k  w
0  0  1  1  0  1
1  1  1  0  0  1

c = df.columns.values
df = pd.DataFrame(c[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat')
print (df)
  cat0 cat1 cat2
0    y    z    w
1    x    y    w

详细信息：

#get indices of 1s
print (np.where(df))
(array([0, 0, 0, 1, 1, 1], dtype=int64), array([1, 2, 4, 0, 1, 4], dtype=int64))

#seelct second array
print (np.where(df)[1])
[1 2 4 0 1 4]

#reshape to 3 columns
print (np.where(df)[1].reshape(-1, 3))
[[1 2 4]
 [0 1 4]]

#indexing
print (c[np.where(df)[1].reshape(-1, 3)])
[['y' 'z' 'w']
 ['x' 'y' 'w']]

df = pd.concat([df] * 1000, ignore_index=True)

#jezrael solution
In [390]: %timeit (pd.DataFrame(df.columns.values[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat'))
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 503 µs per loop

#jpp solution
In [391]: %timeit (pd.DataFrame(df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()))
10 loops, best of 3: 111 ms per loop

#Zero solution working only with one row DataFrame, so not included

计时：

#get indices of 1s
print (np.where(df))
(array([0, 0, 0, 1, 1, 1], dtype=int64), array([1, 2, 4, 0, 1, 4], dtype=int64))

#seelct second array
print (np.where(df)[1])
[1 2 4 0 1 4]

#reshape to 3 columns
print (np.where(df)[1].reshape(-1, 3))
[[1 2 4]
 [0 1 4]]

#indexing
print (c[np.where(df)[1].reshape(-1, 3)])
[['y' 'z' 'w']
 ['x' 'y' 'w']]

df = pd.concat([df] * 1000, ignore_index=True)

#jezrael solution
In [390]: %timeit (pd.DataFrame(df.columns.values[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat'))
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 503 µs per loop

#jpp solution
In [391]: %timeit (pd.DataFrame(df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()))
10 loops, best of 3: 111 ms per loop

#Zero solution working only with one row DataFrame, so not included

可能重复：。非常相似，但这似乎只允许在只有一个类别时工作。至少，在那个例子中是这样。@NTiberio-是重要的性能吗？请检查我答案中的时间安排。@NTiberio，如果以下解决方案之一有助于您自由接受（在左侧打勾），以便其他用户知道。