Python 根据列中的值,使用不同的函数有条件地聚合分组数据帧
考虑以下数据帧Python 根据列中的值,使用不同的函数有条件地聚合分组数据帧,python,pandas,Python,Pandas,考虑以下数据帧 import pandas as pd df = pd.DataFrame({"val":[1, 2, 3, 10, 20, 30, 40], "group_id":["ones", "ones", "ones", "tens", "tens", "tens", "tens
import pandas as pd
df = pd.DataFrame({"val":[1, 2, 3, 10, 20, 30, 40],
"group_id":["ones", "ones", "ones", "tens", "tens", "tens", "tens"],
"condition":["sum", "sum", "sum", "mean", "mean", "mean", "mean"]})
我想通过在group\u id
上分组,然后对每个组应用不同的聚合函数,来聚合df[“val”]
中的数据。为了确定要使用哪个聚合函数,我想设置一个引用df
中另一列的条件,即condition
具体地说,我想取一组的val
中所有元素的总和,以及十组的所有元素的平均值。(但是我不需要从条件中提取聚合函数的名称。条件列可以是任何内容,只要每个组都有相同的条件,例如所有“one”
对应于“sum”
。条件列可能是多余的?)
我想得到以下结果
df_aggregated = pd.DataFrame({"group_id":["ones", "tens"],
"val_aggregated":["6", "25"]})
使用R和dplyr有一种干净的方法:
df <- tibble(val = c(1, 2, 3, 10, 20, 30, 40),
group_id = c("ones", "ones", "ones", "tens", "tens", "tens", "tens"),
condition = c("sum", "sum", "sum", "mean", "mean", "mean", "mean"))
df_aggregated <- df %>%
group_by(group_id) %>%
summarise(val_aggregated = case_when(condition == "sum" ~ sum(val),
condition == "mean" ~ mean(val),
TRUE ~ NA_real_)) %>%
distinct()
df%
总结(val_聚合=案例(条件==“总和”~总和(val)),
条件==“平均”~平均值(val),
真~NA_real(真)]>%
不同的()
但我似乎找不到一个好方法在熊猫身上进行这种聚集。也许解决方案涉及NumPy的select()
函数?或者,惯用的方法是在分组数据结构上循环
非常感谢您提供的任何帮助 实现这一点的一种方法是在group\u id
和条件上分组并聚合:
(
df.groupby(["group_id", "condition"])
.agg(["sum", "mean"])
.stack()
.reset_index()
# keeps only rows where condition equals aggregates
.query("condition==level_2")
.drop(columns=["condition", "level_2"])
.rename(columns={"val": "val_aggregated"})
)
group_id val_aggregated
0 ones 6
3 tens 25
result = df.pivot(columns=["group_id", "condition"], values="val")
result
group_id ones tens
condition sum mean
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 10.0
4 NaN 20.0
5 NaN 30.0
6 NaN 40.0
(
result.droplevel(level="condition", axis="columns")
.agg(mapping)
.rename_axis(index="group_id")
.reset_index(name="val_aggregated")
)
group_id val_aggregated
0 ones 6.0
1 tens 25.0
另一种方法是透视数据,然后聚合:
(
df.groupby(["group_id", "condition"])
.agg(["sum", "mean"])
.stack()
.reset_index()
# keeps only rows where condition equals aggregates
.query("condition==level_2")
.drop(columns=["condition", "level_2"])
.rename(columns={"val": "val_aggregated"})
)
group_id val_aggregated
0 ones 6
3 tens 25
result = df.pivot(columns=["group_id", "condition"], values="val")
result
group_id ones tens
condition sum mean
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 10.0
4 NaN 20.0
5 NaN 30.0
6 NaN 40.0
(
result.droplevel(level="condition", axis="columns")
.agg(mapping)
.rename_axis(index="group_id")
.reset_index(name="val_aggregated")
)
group_id val_aggregated
0 ones 6.0
1 tens 25.0
将个数
和个数
配对到总和
和平均数
:
mapping = zip(*result.columns)
mapping = dict(zip(*mapping))
mapping
{'ones': 'sum', 'tens': 'mean'}
在列上删除条件
级别并聚合:
(
df.groupby(["group_id", "condition"])
.agg(["sum", "mean"])
.stack()
.reset_index()
# keeps only rows where condition equals aggregates
.query("condition==level_2")
.drop(columns=["condition", "level_2"])
.rename(columns={"val": "val_aggregated"})
)
group_id val_aggregated
0 ones 6
3 tens 25
result = df.pivot(columns=["group_id", "condition"], values="val")
result
group_id ones tens
condition sum mean
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 10.0
4 NaN 20.0
5 NaN 30.0
6 NaN 40.0
(
result.droplevel(level="condition", axis="columns")
.agg(mapping)
.rename_axis(index="group_id")
.reset_index(name="val_aggregated")
)
group_id val_aggregated
0 ones 6.0
1 tens 25.0
另一个与dplyr的解决方案有点类似的选项是使用np。其中
,正如您在问题中提到的:
group = df.groupby("group_id")
(
df.assign(
val_aggregate=np.where(
df.condition.eq("sum"),
group.val.transform("sum"),
group.val.transform("mean"),
)
)
.loc[:, ["group_id", "val_aggregate"]]
.drop_duplicates()
)
group_id val_aggregate
0 ones 6
3 tens 25
最快、最简单的方法是将df.groupby('group_id').val.agg(['mean','sum'])和