Python 使用熊猫执行一个热编码_Python_Pandas

Python 使用熊猫执行一个热编码

python pandas

Python 使用熊猫执行一个热编码,python,pandas,Python,Pandas,我正在用熊猫制作以下数据框： df=pd.DataFrame(np.array([20,"admin","France", 25,"worker","Italy", 45,"admin","Norway", 30,"sec","EEUU", 25,"law",np.N

我正在用熊猫制作以下数据框：

df=pd.DataFrame(np.array([20,"admin","France",
                             25,"worker","Italy",
                             45,"admin","Norway",
                             30,"sec","EEUU",
                             25,"law",np.NaN,
                             30,"sec","France"]

            ).reshape(6,3))
df.columns=["age","job","country"]

我想执行一个热编码，但不使用get_dummies功能，而是使用OneHotEncoder。因此，我编写了以下代码：

def oneHotEncoding(df):
    ohe=preprocessing.OneHotEncoder(dtype=np.int,sparse=True,handle_unknown="ignore")
    values=pd.DataFrame(ohe.fit_transform(df[["country"]]).toarray())
    df=pd.concat([df,values],axis=1)
    df=df.drop(["country"],1)
    print(df)

values=pd.DataFrame(ohe.fit_transform(df[["country"]]).toarray(),columns=["country_"+str(int(i)) for i in range(df.shape[1])])

问题是，当我得到结果时，我会得到如下结果：

   age  job    0  1  2  3  4
0   20  admin  0  1  0  0  0
1   25  worker 0  0  1  0  0
2   45  admin  0  0  0  1  0
3   30  sec    1  0  0  0  0
4   25  law    0  0  0  0  1
5   30  sec    0  1  0  0  0

我希望在结果列中出现类似法国、意大利等国的内容，我尝试了以下代码：

def oneHotEncoding(df):
    ohe=preprocessing.OneHotEncoder(dtype=np.int,sparse=True,handle_unknown="ignore")
    values=pd.DataFrame(ohe.fit_transform(df[["country"]]).toarray())
    df=pd.concat([df,values],axis=1)
    df=df.drop(["country"],1)
    print(df)

values=pd.DataFrame(ohe.fit_transform(df[["country"]]).toarray(),columns=["country_"+str(int(i)) for i in range(df.shape[1])])

但是它没有给我正确的结果

此外，nan值仍被视为一个国家，应该只有0

我如何解决这些问题？我已经测试了我在这里发现的各种可能性，但没有任何帮助

谢谢

在

pandas

中，我们有

get\u假人

pd.get_dummies(df,columns=['country'])
Out[429]: 
  age     job     ...       country_Norway  country_nan
0  20   admin     ...                    0            0
1  25  worker     ...                    0            0
2  45   admin     ...                    1            0
3  30     sec     ...                    0            0
4  25     law     ...                    0            1
5  30     sec     ...                    0            0
[6 rows x 7 columns]

在

pandas

中，我们有

get\u假人

pd.get_dummies(df,columns=['country'])
Out[429]: 
  age     job     ...       country_Norway  country_nan
0  20   admin     ...                    0            0
1  25  worker     ...                    0            0
2  45   admin     ...                    1            0
3  30     sec     ...                    0            0
4  25     law     ...                    0            1
5  30     sec     ...                    0            0
[6 rows x 7 columns]

如果坚持使用

onehotcoder

，您的问题是稀疏矩阵没有列数据，在您的示例中，属性实际上存储在

ohe

上

使用

fit\u transform

后，您可以从

OneHotEncoder

上的

categories\u

属性访问类别

如果坚持使用

onehotcoder

，您的问题是稀疏矩阵没有列数据，在您的示例中，属性实际上存储在

ohe

上

使用

fit\u transform

后，您可以从

OneHotEncoder

上的

categories\u

属性访问类别

谢谢@WeNYoBen，但出于实践原因，我想知道如何使用OneHotencoderTanks@WeNYoBen制作，但出于实践原因，我想知道如何使用OneHotencoder制作感谢你的回答@user3483203我有一个愚蠢的问题，当你检索稀疏矩阵时，你怎么知道变量是一个变量？我的意思是，在data.A中，我已经在不同的手册中找到了，但没有找到，它是像A=Array这样的Python约定吗？谢谢你这只是一个将稀疏矩阵转换为数组的简写，要得到一维数组，简写是

A1

谢谢你的回答@user3483203我有一个愚蠢的问题，当你检索稀疏矩阵时，你怎么知道变量是a？我的意思是，在data.A中，我已经在不同的手册中找到了，但没有找到，它是像A=Array这样的Python约定吗？谢谢，这只是将稀疏矩阵转换为数组的简写，要得到一维数组，简写是

A1