Python 基于另一列中的值规范化dataframe中的列_Python_Pandas_Dataframe

Python 基于另一列中的值规范化dataframe中的列

python pandas dataframe

Python 基于另一列中的值规范化dataframe中的列,python,pandas,dataframe,Python,Pandas,Dataframe,我想基于另一列中的值来规范化数据帧中一列中的值。这不是统计意义上的纯标准化。第二个值是一个类型；我想对每种类型的所有第一个值求和，然后在每一行中，将该值除以该行类型的总和。举个例子可以让这一点更清楚 df = pd.read_table(datafile, names = ["A", "B", "value", "type"]) A B value type 0 A1 B1 1 type1 1 A2 B2 1 type1 2 A1 B1

我想基于另一列中的值来规范化数据帧中一列中的值。这不是统计意义上的纯标准化。第二个值是一个类型；我想对每种类型的所有第一个值求和，然后在每一行中，将该值除以该行类型的总和。举个例子可以让这一点更清楚

df = pd.read_table(datafile, names = ["A", "B", "value", "type"])

    A   B  value   type
0  A1  B1      1  type1
1  A2  B2      1  type1
2  A1  B1      1  type2
3  A1  B3      1  type3
4  A2  B2      1  type2
5  A2  B4      1  type3
6  A3  B4      1  type2
7  A3  B5      1  type3
8  A4  B6      1  type2
9  A4  B7      1  type3

然后我可以用如下公式求和：

types = df.groupby(["type"])["value"].sum()

type
type1    2
type2    4
type3    4
Name: value, dtype: int64

那么如何使用它来规范化每行中的值

我可以使用这样的循环来计算标准化值：

norms = []
for ix, row in df.iterrows():
    norms.append(row["value"]/types[row["type"]])

然后将该列替换为具有以下值的新列：

df["value"] = pd.Series(norms)

    A   B  value   type
0  A1  B1   0.50  type1
1  A2  B2   0.50  type1
2  A1  B1   0.25  type2
3  A1  B3   0.25  type3
4  A2  B2   0.25  type2
5  A2  B4   0.25  type3
6  A3  B4   0.25  type2
7  A3  B5   0.25  type3
8  A4  B6   0.25  type2
9  A4  B7   0.25  type3

但据我所知，使用这样的循环不是很有效或合适，可能有一种方法可以使用一些标准函数来实现

谢谢。

我认为实现这一点的最佳方法是在groupby对象上使用

.apply（）

方法：

# Using backslashes for explicit line continuation, not seen
#   that often in Python but useful in pandas when you're
#   chaining a lot of methods one after the other
df['value_normed'] = df.groupby('type', group_keys=False)\
    .apply(lambda g: g['value'] / g['value'].sum())
df
Out[9]: 
    A   B  value   type  value_normed
0  A1  B1      1  type1          0.50
1  A2  B2      1  type1          0.50
2  A1  B1      1  type2          0.25
3  A1  B3      1  type3          0.25
4  A2  B2      1  type2          0.25
5  A2  B4      1  type3          0.25
6  A3  B4      1  type2          0.25
7  A3  B5      1  type3          0.25
8  A4  B6      1  type2          0.25
9  A4  B7      1  type3          0.25

您需要使用

group\u keys=False

参数，以便

type

不会成为每个组数据的索引，这会阻止您轻松地将转换后的值与原始数据帧进行匹配。

您可以使用

transform

，它对每个组执行一个操作，然后将结果向后展开以匹配原始索引。例如“

因为我们有

>>> df.groupby("type")["value"].transform(sum)
0    2
1    2
2    4
3    4
4    4
5    4
6    4
7    4
8    4
9    4
dtype: int64

>>> df.groupby("type")["value"].transform(sum)
0    2
1    2
2    4
3    4
4    4
5    4
6    4
7    4
8    4
9    4
dtype: int64