Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/sql-server-2008/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 根据列中的值,使用不同的函数有条件地聚合分组数据帧_Python_Pandas - Fatal编程技术网

Python 根据列中的值,使用不同的函数有条件地聚合分组数据帧

Python 根据列中的值,使用不同的函数有条件地聚合分组数据帧,python,pandas,Python,Pandas,考虑以下数据帧 import pandas as pd df = pd.DataFrame({"val":[1, 2, 3, 10, 20, 30, 40], "group_id":["ones", "ones", "ones", "tens", "tens", "tens", "tens

考虑以下数据帧

import pandas as pd

df = pd.DataFrame({"val":[1, 2, 3, 10, 20, 30, 40],
                   "group_id":["ones", "ones", "ones", "tens", "tens", "tens", "tens"],
                   "condition":["sum", "sum", "sum", "mean", "mean", "mean", "mean"]})
我想通过在
group\u id
上分组,然后对每个组应用不同的聚合函数,来聚合
df[“val”]
中的数据。为了确定要使用哪个聚合函数,我想设置一个引用
df
中另一列的条件,即
condition

具体地说,我想取
一组的
val
中所有元素的总和,以及
十组的所有元素的平均值。(但是我不需要从
条件
中提取聚合函数的名称。
条件
列可以是任何内容,只要每个组都有相同的条件,例如所有
“one”
对应于
“sum”
条件
列可能是多余的?)

我想得到以下结果

df_aggregated = pd.DataFrame({"group_id":["ones", "tens"],
                              "val_aggregated":["6", "25"]})
使用R和dplyr有一种干净的方法:

df <- tibble(val = c(1, 2, 3, 10, 20, 30, 40),
             group_id = c("ones", "ones", "ones", "tens", "tens", "tens", "tens"),
             condition = c("sum", "sum", "sum", "mean", "mean", "mean", "mean"))

df_aggregated <- df %>%
  group_by(group_id) %>% 
  summarise(val_aggregated = case_when(condition == "sum" ~ sum(val),
                                       condition == "mean" ~ mean(val),
                                       TRUE ~ NA_real_)) %>% 
  distinct()
df%
总结(val_聚合=案例(条件==“总和”~总和(val)),
条件==“平均”~平均值(val),
真~NA_real(真)]>%
不同的()
但我似乎找不到一个好方法在熊猫身上进行这种聚集。也许解决方案涉及NumPy的
select()
函数?或者,惯用的方法是在分组数据结构上循环


非常感谢您提供的任何帮助

实现这一点的一种方法是在
group\u id
条件上分组并聚合:

(
    df.groupby(["group_id", "condition"])
    .agg(["sum", "mean"])
    .stack()
    .reset_index() 
     # keeps only rows where condition equals aggregates
    .query("condition==level_2")
    .drop(columns=["condition", "level_2"])
    .rename(columns={"val": "val_aggregated"})
)

    group_id    val_aggregated
0      ones         6
3      tens         25
result = df.pivot(columns=["group_id", "condition"], values="val")
result

group_id    ones    tens
condition   sum     mean
0           1.0     NaN
1           2.0     NaN
2           3.0     NaN
3           NaN     10.0
4           NaN     20.0
5           NaN     30.0
6           NaN     40.0
(
    result.droplevel(level="condition", axis="columns")
    .agg(mapping)
    .rename_axis(index="group_id")
    .reset_index(name="val_aggregated")
)


    group_id    val_aggregated
0       ones    6.0
1       tens    25.0
另一种方法是透视数据,然后聚合:

(
    df.groupby(["group_id", "condition"])
    .agg(["sum", "mean"])
    .stack()
    .reset_index() 
     # keeps only rows where condition equals aggregates
    .query("condition==level_2")
    .drop(columns=["condition", "level_2"])
    .rename(columns={"val": "val_aggregated"})
)

    group_id    val_aggregated
0      ones         6
3      tens         25
result = df.pivot(columns=["group_id", "condition"], values="val")
result

group_id    ones    tens
condition   sum     mean
0           1.0     NaN
1           2.0     NaN
2           3.0     NaN
3           NaN     10.0
4           NaN     20.0
5           NaN     30.0
6           NaN     40.0
(
    result.droplevel(level="condition", axis="columns")
    .agg(mapping)
    .rename_axis(index="group_id")
    .reset_index(name="val_aggregated")
)


    group_id    val_aggregated
0       ones    6.0
1       tens    25.0
个数
个数
配对到
总和
平均数

mapping = zip(*result.columns)
mapping = dict(zip(*mapping))
mapping
{'ones': 'sum', 'tens': 'mean'}
在列上删除
条件
级别并聚合:

(
    df.groupby(["group_id", "condition"])
    .agg(["sum", "mean"])
    .stack()
    .reset_index() 
     # keeps only rows where condition equals aggregates
    .query("condition==level_2")
    .drop(columns=["condition", "level_2"])
    .rename(columns={"val": "val_aggregated"})
)

    group_id    val_aggregated
0      ones         6
3      tens         25
result = df.pivot(columns=["group_id", "condition"], values="val")
result

group_id    ones    tens
condition   sum     mean
0           1.0     NaN
1           2.0     NaN
2           3.0     NaN
3           NaN     10.0
4           NaN     20.0
5           NaN     30.0
6           NaN     40.0
(
    result.droplevel(level="condition", axis="columns")
    .agg(mapping)
    .rename_axis(index="group_id")
    .reset_index(name="val_aggregated")
)


    group_id    val_aggregated
0       ones    6.0
1       tens    25.0
另一个与dplyr的解决方案有点类似的选项是使用
np。其中
,正如您在问题中提到的:

group = df.groupby("group_id")

(
    df.assign(
        val_aggregate=np.where(
            df.condition.eq("sum"),
            group.val.transform("sum"),
            group.val.transform("mean"),
        )
    )
    .loc[:, ["group_id", "val_aggregate"]]
    .drop_duplicates()
)

    group_id    val_aggregate
0       ones        6
3       tens        25

最快、最简单的方法是将
df.groupby('group_id').val.agg(['mean','sum'])和