Python R组按%>;%总结大熊猫中的等价物
我正在尝试将一些代码从R重写为python 我的df有点像Python R组按%>;%总结大熊猫中的等价物,python,python-3.x,pandas,Python,Python 3.x,Pandas,我正在尝试将一些代码从R重写为python 我的df有点像 size = 20 np.random.seed(456) df = pd.DataFrame({"names": np.random.choice(["bob", "alb", "jr"], size=size, replace=True), "income": np.random.normal(size=size, loc=1000, scale=100), "
size = 20
np.random.seed(456)
df = pd.DataFrame({"names": np.random.choice(["bob", "alb", "jr"], size=size, replace=True),
"income": np.random.normal(size=size, loc=1000, scale=100),
"costs": np.random.normal(size=size, loc=500, scale=100),
"date": np.random.choice(pd.date_range("2018-01-01", "2018-01-06"),
size=size, replace=True)
})
现在我需要按名称对df进行分组,然后执行一些汇总操作
在R,dplyr中,我正在做
dfg <- group_by(df, names) %>%
summarise(
income.acc = sum(income),
costs.acc = sum(costs),
net = sum(income) - sum(costs),
income.acc.bymax = sum(income[date==max(date)]),
cost.acc.bymax = sum(costs[date==max(date)]),
growth = income.acc.bymax + cost.acc.bymax - net
)
我认为您需要自定义功能:
def f(x):
income_acc = x.income.sum()
costs_acc = x.costs.sum()
net = income_acc - costs_acc
income_acc_bymax = x.loc[x.date == x.date.max(), 'income'].sum()
cost_acc_bymax = x.loc[x.date == x.date.max(), 'costs'].sum()
growth = income_acc_bymax + cost_acc_bymax - net
c = ['income_acc','costs_acc','net','income_acc_bymax','cost_acc_bymax','growth']
return pd.Series([income_acc, costs_acc, net, income_acc_bymax, cost_acc_bymax, growth],
index=c)
df1 = df.groupby('names').apply(f)
print (df1)
income_acc costs_acc net income_acc_bymax \
names
alb 7746.653816 3605.367002 4141.286814 2785.500946
bob 6348.897809 3354.059777 2994.838032 2153.386953
jr 6205.690386 3034.601030 3171.089356 983.316234
cost_acc_bymax growth
names
alb 1587.685103 231.899235
bob 1215.116245 373.665167
jr 432.851030 -1754.922093
现在,您可以使用与在R中相同的方法来执行此操作:
>>来自数据r.all导入f,分组依据,汇总,总和,最大值
>>>
>>>dfg=分组依据(df,f.名称)>>总结(
…收入=总和(f.收入),
…成本=总成本(f.成本),
…净=总和(f.收入)-总和(f.成本),
…收入=总和(f.收入[f.日期==最大值(f.日期)],
…cost_acc_bymax=总和(f.costs[f.date==max(f.date)]),
…增长=f.收入(按最大值计算)f.成本(按最大值计算)f.净成本(按最大值计算)
... )
>>>dfg
命名收入\会计科目成本\会计科目净收入\会计科目按最大成本\会计科目按最大增长
0 alb 7746.653816 3605.367002 4141.286814 2785.500946 1587.685103 231.899235
1 bob 6348.897809 3354.059777 2994.838032 2153.386953 1215.116245 373.665167
2 jr 6205.690386 3034.601030 3171.089356 983.316234 432.851030-1754.922093
我是这个包裹的作者。如果您有任何问题,请随时提交问题。如果使用
np.random.seed(456)
在创建DataFrame
之前,您的预期输出是什么?现在编辑我的问题以更加精确谢谢,您能从示例数据中添加预期输出吗?如果每组有多个最大日期,我会稍微修改答案。是否可以检查R代码的输出?我没有R工作室,所以不能做。
income_acc costs_acc net income_acc_bymax \
names
alb 7997.466538 3996.053670 4001.412868 2997.855009
bob 6003.488978 3003.540598 2999.948380 2001.533870
jr 6002.056904 3000.346010 3001.710894 999.833162
cost_acc_bymax growth
names
alb 1500.876851 497.318992
bob 1002.151162 3.736652
jr 499.328510 -1502.549221
def f(x):
income_acc = x.income.sum()
costs_acc = x.costs.sum()
net = income_acc - costs_acc
income_acc_bymax = x.loc[x.date == x.date.max(), 'income'].sum()
cost_acc_bymax = x.loc[x.date == x.date.max(), 'costs'].sum()
growth = income_acc_bymax + cost_acc_bymax - net
c = ['income_acc','costs_acc','net','income_acc_bymax','cost_acc_bymax','growth']
return pd.Series([income_acc, costs_acc, net, income_acc_bymax, cost_acc_bymax, growth],
index=c)
df1 = df.groupby('names').apply(f)
print (df1)
income_acc costs_acc net income_acc_bymax \
names
alb 7746.653816 3605.367002 4141.286814 2785.500946
bob 6348.897809 3354.059777 2994.838032 2153.386953
jr 6205.690386 3034.601030 3171.089356 983.316234
cost_acc_bymax growth
names
alb 1587.685103 231.899235
bob 1215.116245 373.665167
jr 432.851030 -1754.922093