Pandas 熊猫团员+;多个聚合/应用多个列

Pandas 熊猫团员+;多个聚合/应用多个列,pandas,Pandas,我有这个最小的样本数据: import pandas as pd from pandas import Timestamp data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3",

我有这个最小的样本数据:

import pandas as pd
from pandas import Timestamp

data = pd.DataFrame({'Client': {0: "Client_1",  1: "Client_2",  2: "Client_2",  3: "Client_3",  4: "Client_3",  5: "Client_3",  6: "Client_4",  7: "Client_4"},
 'Id_Card': {0: 1,  1: 2,  2: 3,  3: 4,  4: 5,  5: 6,  6: 7,  7: 8},
 'Type': {0: 'A',  1: 'B',  2: 'C',  3: np.nan,  4: 'A',  5: 'B',  6: np.nan,  7: 'B'},
 'Loc': {0: 'ADW',  1: 'ZCW',  2: 'EWC',  3: "VWQ",  4: "OKS",  5: 'EQW',  6: "PKA",  7: 'CSA'},
 'Amount': {0: 10.0,  1: 15.0,  2: 17.0,  3: 32.0,  4: np.nan,  5: 51.0,  6: 38.0,  7: -20.0},
 'Net': {0: 30.0,  1: 42.0,  2: -10.0,  3: 15.0,  4: 98,  5: np.nan,  6: 23.0,  7: -10.0},
 'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data

我正试图通过
Client
列聚合此数据分组。计算每个客户的
Id_卡
,将
类型
Loc
,用
分隔(例如,
A;B
ZCW;EWC
客户2
的值,而不是
A;ZCW
B;EWC
),对每个客户的
金额
净值
求和,并获得每个客户的最小
日期
。然而,我面临一些问题:

  • 这些函数可以单独完美地工作,但我找不到混合
    聚合
    函数和
    应用
    函数的方法:
  • 代码示例:

    data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
    data.groupby('Client')['Loc'].apply(';'.join).reset_index()
    
    data.groupby('Client')['Type'].apply(';'.join).reset_index()
    TypeError: sequence item 0: expected str instance, float found
    
    cols_to_sum = ["Amount", "Net"]
    data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
    
    cols_to_join = ["Type", "Loc"]
    data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
    
  • apply函数不适用于缺少值的列:
  • 代码示例:

    data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
    data.groupby('Client')['Loc'].apply(';'.join).reset_index()
    
    data.groupby('Client')['Type'].apply(';'.join).reset_index()
    TypeError: sequence item 0: expected str instance, float found
    
    cols_to_sum = ["Amount", "Net"]
    data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
    
    cols_to_join = ["Type", "Loc"]
    data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
    
  • 聚合和应用函数不允许我为一次转换放置多个列:
  • 代码示例:

    data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
    data.groupby('Client')['Loc'].apply(';'.join).reset_index()
    
    data.groupby('Client')['Type'].apply(';'.join).reset_index()
    TypeError: sequence item 0: expected str instance, float found
    
    cols_to_sum = ["Amount", "Net"]
    data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
    
    cols_to_join = ["Type", "Loc"]
    data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
    
    在(3)中,我只将
    Amount
    Net
    放在聚合函数中,我可以将它们分别放在聚合函数中,但我正在寻找一种更有效的方法,因为我正在处理大量的列


    预期的输出是相同的数据帧,但与beggining中列出的条件聚合。

    逐步进行,准备三个不同的数据帧,以便稍后合并它们。 第一个数据帧用于简单的函数,如
    count、sum、mean

    df1=data.groupby(“Client”).agg({“Id_Card”:“count”,“Amount”:“sum”,“Net”:sum,“Date”:“min”)。重置索引()
    
    接下来处理
    Type
    Loc
    join,我们使用fill na来处理nan值

    df2=data[['Client','Type']].fillna('').groupby(“Client”)['Type'].apply(
    “;”.join)。重置索引()
    df3=数据[['Client','Loc']].填充('').groupby(“Client”)['Loc'].应用(
    “;”.join)。重置索引()
    
    最后将结果合并在一起:

    data_new = df1.merge(df2, on='Client').merge(df3, on='Client')
    
    新输出的数据:


    要进行连接,必须过滤掉NaN值。由于join必须在两个位置应用,因此我创建了一个单独的函数

    def join_non_nan_values(elements):
        return ";".join([elem for elem in elements if elem == elem])  # elem == elem will fail for Nan values
    
    data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
                                "Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})
    

    是否有可能将函数中缺少的值串联起来?例如,客户机4的
    类型
    NaN;B
    ,或
    nan;B
    。是的,我们可以这样做
    “;”。连接([elem if elem==elem else str(elem)for elem in elems in elements])
    因为NaN是float,所以我们必须转换为string。