Python k-hot对多列中的值进行编码_Python_Pandas_Numpy

Python k-hot对多列中的值进行编码

python pandas numpy

Python k-hot对多列中的值进行编码,python,pandas,numpy,Python,Pandas,Numpy,我有熊猫。数据框： | | col_1 | col_2 | col_3 | col_4 | |:--|:------|:------|:------|:------| | 0 | 1 | 2 | NaN | NaN | | 1 | 3 | 4 | 5 | 6 | | 2 | 2 | 6 | NaN | NaN | 我必须将值（1、2、3、4、5、6）转换为列，如果该值位于行中，则为行设置1，否则为0： |

我有熊猫。数据框：

|   | col_1 | col_2 | col_3 | col_4 |
|:--|:------|:------|:------|:------|
| 0 |   1   |   2   |  NaN  |  NaN  |
| 1 |   3   |   4   |   5   |   6   |
| 2 |   2   |   6   |  NaN  |  NaN  |

我必须将值（1、2、3、4、5、6）转换为列，如果该值位于行中，则为行设置1，否则为0：

|   | 1 | 2 | 3 | 4 | 5 | 6 |
|:--|:--|:--|:--|:--|:--|:--|
| 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 0 | 0 | 0 | 1 |

pd.get_dummies

在这里不起作用。正如我所看到的，

pd.get\u dummies

无法通过数据帧列中的所有值进行热编码
我如何才能实现它？
一种方法-
考虑到内存效率，再来一个-

idx = np.searchsorted(constant_set, a) out = np.zeros((len(df),len(constant_set)),dtype=int) flattend_idx = idx + out.shape[1]*np.arange(len(idx))[:,None] out.flat[flattend_idx[idx<len(constant_set)]] = 1

idx=np.searchsorted（常数集，a） out=np.zeros（（len（df），len（常量集）），dtype=int）展平的idx=idx+out.shape[1]*np.arange（len（idx））[：，无] out.flat[flatted_idx[idx您还可以使用get_dummies 功能，如下所示： import numpy as np import pandas as pd # The definition of your dataframe df = pd.DataFrame({'col_1': [1, 3, 2], 'col_2': [2, 4, 6], 'col_3': [np.NaN, 5, np.NaN], 'col_4': [np.NaN, 6, np.NaN]}, dtype=float) # Get dummies where you leave out the prefix # This will ensure that all columns of the same value will get the same column name df = pd.get_dummies(df, columns=['col_1', 'col_2', 'col_3', 'col_4'], prefix='') # Initialise your result result = pd.DataFrame() # Use the groupby method to group on column name for column, data in df.groupby(level=0, axis=1): # Merge data of same columns into one column result[column] = data.max(axis=1) 因此，我们在这里要做的是在所有列上应用get_假人，结果如下 _1.0 _2.0 _3.0 _2.0 _4.0 _6.0 _5.0 _6.0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 2 0 1 0 1 0 1 0 1 然后我们合并所有具有相同名称的列以获得所需的结果 _1.0 _2.0 _3.0 _4.0 _5.0 _6.0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 2 0 1 0 0 0 1 另一种方法是使用pd.melt（）：它不是那样工作的。正如我所看到的，pd.get\u dummies 无法对数据帧列中的所有值进行热编码，如果col\u 4 处的6 被常量集列中的say9？@Divakar值替换，那么这是不可能的。 _1.0 _2.0 _3.0 _4.0 _5.0 _6.0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 2 0 1 0 0 0 1 # Set it up. import numpy as np; import pandas as pd; df = pd.DataFrame({'col_1': [1, 3, 2], 'col_2': [2, 4, 6], 'col_3': [np.NaN, 5, np.NaN], 'col_4': [np.NaN, 6, np.NaN]}, dtype=float) (pd.get_dummies( # Pandas' one-hot function df.T.melt() # Flip DataFrame, then switch from wide to long format. .set_index('variable')['value']) # "variable' is the row name (id) in your orig DataFrame. .groupby('variable') .sum()) # Coalesce same ids and add rows together.