Python 数据帧中一个热编码列的统计信息_Python_Pandas_Dataframe

Python 数据帧中一个热编码列的统计信息

python pandas dataframe

Python 数据帧中一个热编码列的统计信息,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个熊猫数据框，其中有一列标题为“label”。它有三列，分别标题为featureA\u 1、featureA\u 2、featureA\u 3。这些列表示表示一个热编码的feature（可以有三个唯一值）的列。同样，它也有两个列，分别名为featureB_1和featureB_2。这些列表示featureB的一个热编码值（可以有两个不同的值）下面是所述数据帧的示例可以使用以下方法生成上述数据帧： import pandas as pd dictt = { "labe

我有一个熊猫数据框，其中有一列标题为

“label”

。它有三列，分别标题为

featureA\u 1、featureA\u 2、featureA\u 3

。这些列表示表示一个热编码的

feature

（可以有三个唯一值）的列。同样，它也有两个列，分别名为

featureB_1

和

featureB_2

。这些列表示

featureB

的一个热编码值（可以有两个不同的值）

下面是所述数据帧的示例

可以使用以下方法生成上述数据帧：

import pandas as pd
dictt = {
    "label": ["cat", "cat", "cat", "cat", "cat", "dog", "dog", "dog"],
    "featureA_1": [1, 0, 1, 1, 0, 1, 1, 0],
    "featureA_2": [0, 1, 0, 0, 0, 0, 0, 0],
    "featureA_3": [0, 0, 0, 0, 1, 0, 0, 1],
    "featureB_1": [0, 0, 1, 1, 0, 0, 1, 1],
    "featureB_2": [1, 1, 0, 0, 1, 1, 0, 0],
}

df1 = pd.DataFrame(dictt)

由于一个热编码，上述数据帧中的每一行只有一个特征值

featureA_1、featureA_2、featureA_3

的值为1，其他值为0。类似地，每一行只有一个特征值

featureB_1

和

featureB_2

的值为1，另一个为0

我想创建一个数据框，在该数据框中，每个标签中具有特征值
featureA_1、featureA_2、featureA_3
的条目百分比以及每个标签中具有特征值
featureB_1
和
featureB_2
的条目百分比
我还想得到FeatureUrea值类型和featureB值类型百分比的标准偏差。
以下是我希望拥有的数据帧示例：

这样做最有效的方法是什么？在我的实际工作中，我将拥有数百万行的数据帧。
使用：

#aggregate mean for percentages of 1, because only 0, 1 values df = df1.groupby('label').mean().add_suffix('_perc').round(2) #aggregate std witg ddof=0, because default pandas ddof=1 df2 = df.groupby(lambda x: x.split('_')[0], axis=1).std(ddof=0).add_suffix('_std').round(2) #join together df = pd.concat([df, df2],axis=1).sort_index(axis=1).reset_index() print (df) label featureA_1_perc featureA_2_perc featureA_3_perc featureA_std \ 0 cat 0.60 0.2 0.20 0.19 1 dog 0.67 0.0 0.33 0.27 featureB_1_perc featureB_2_perc featureB_std 0 0.40 0.60 0.10 1 0.67 0.33 0.17