Pandas 熊猫团员+;多个聚合/应用多个列
我有这个最小的样本数据:Pandas 熊猫团员+;多个聚合/应用多个列,pandas,Pandas,我有这个最小的样本数据: import pandas as pd from pandas import Timestamp data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3",
import pandas as pd
from pandas import Timestamp
data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3", 6: "Client_4", 7: "Client_4"},
'Id_Card': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Type': {0: 'A', 1: 'B', 2: 'C', 3: np.nan, 4: 'A', 5: 'B', 6: np.nan, 7: 'B'},
'Loc': {0: 'ADW', 1: 'ZCW', 2: 'EWC', 3: "VWQ", 4: "OKS", 5: 'EQW', 6: "PKA", 7: 'CSA'},
'Amount': {0: 10.0, 1: 15.0, 2: 17.0, 3: 32.0, 4: np.nan, 5: 51.0, 6: 38.0, 7: -20.0},
'Net': {0: 30.0, 1: 42.0, 2: -10.0, 3: 15.0, 4: 98, 5: np.nan, 6: 23.0, 7: -10.0},
'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data
我正试图通过Client
列聚合此数据分组。计算每个客户的Id_卡
,将类型
,Loc
,用分隔代码>(例如,A;B
和ZCW;EWC
客户2
的值,而不是A;ZCW
B;EWC
),对每个客户的金额
,净值
求和,并获得每个客户的最小日期
。然而,我面临一些问题:
聚合
函数和应用
函数的方法:data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
在(3)中,我只将Amount
和Net
放在聚合函数中,我可以将它们分别放在聚合函数中,但我正在寻找一种更有效的方法,因为我正在处理大量的列
预期的输出是相同的数据帧,但与beggining中列出的条件聚合。逐步进行,准备三个不同的数据帧,以便稍后合并它们。 第一个数据帧用于简单的函数,如
count、sum、mean
df1=data.groupby(“Client”).agg({“Id_Card”:“count”,“Amount”:“sum”,“Net”:sum,“Date”:“min”)。重置索引()
接下来处理Type
和Loc
join,我们使用fill na来处理nan值
df2=data[['Client','Type']].fillna('').groupby(“Client”)['Type'].apply(
“;”.join)。重置索引()
df3=数据[['Client','Loc']].填充('').groupby(“Client”)['Loc'].应用(
“;”.join)。重置索引()
最后将结果合并在一起:
data_new = df1.merge(df2, on='Client').merge(df3, on='Client')
新输出的数据:
要进行连接,必须过滤掉NaN值。由于join必须在两个位置应用,因此我创建了一个单独的函数
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
"Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})
是否有可能将函数中缺少的值串联起来?例如,客户机4的
类型为NaN;B
,或nan;B
。是的,我们可以这样做“;”。连接([elem if elem==elem else str(elem)for elem in elems in elements])
因为NaN是float,所以我们必须转换为string。