Pandas-如何根据条件对两列中的X个最后值求和
最近我开始学习熊猫。我真的很想找到解决办法,但找不到。问题就在这里 我有一个数据框架:简单的足球数据。 对于每支球队,我想知道他们在前两场比赛中进了多少球;不管他们是主队还是客队。因此,我必须对每个团队的两个不同列中的特定数量的值求和 样本数据:Pandas-如何根据条件对两列中的X个最后值求和,pandas,numpy,data-science,Pandas,Numpy,Data Science,最近我开始学习熊猫。我真的很想找到解决办法,但找不到。问题就在这里 我有一个数据框架:简单的足球数据。 对于每支球队,我想知道他们在前两场比赛中进了多少球;不管他们是主队还是客队。因此,我必须对每个团队的两个不同列中的特定数量的值求和 样本数据: import pandas as pd data = [['2018-02-03', 'manutd', 'chelsea', 3, 1], ['2018-02-08', 'arsenal', 'liverpool', 1, 1],
import pandas as pd
data = [['2018-02-03', 'manutd', 'chelsea', 3, 1], ['2018-02-08', 'arsenal', 'liverpool', 1, 1],
['2018-01-12', 'chelsea', 'westham', 2, 0], ['2018-01-12', 'liverpool', 'manutd', 0, 2],
['2018-03-15', 'arsenal', 'chelsea', 2, 2], ['2018-02-20', 'manutd', 'brighton', 0, 0],
['2018-04-01', 'westham', 'fulham', 1, 0], ['2018-03-15', 'manutd', 'westham', 2, 1]]
df = pd.DataFrame(data, columns = ['event_time', 'home_team', 'away_team', 'home_goals', 'away_goals'])
df['event_time'] = pd.to_datetime(df['event_time'])
df.sort_values(['event_time'],inplace=True, ascending=False)
print(df)
event_date home_team away_team home_goals away_goals
6 2018-04-01 westham fulham 1 0
4 2018-03-15 arsenal chelsea 2 2
7 2018-03-15 manutd westham 2 1
5 2018-02-20 manutd brighton 0 0
1 2018-02-08 arsenal liverpool 1 1
0 2018-02-03 manutd chelsea 3 1
2 2018-01-12 chelsea westham 2 0
3 2018-01-12 liverpool manutd 0 2
我想要达到的目标:
event_time home_team away_team home_goals away_goals h_goals_previous_2 a_goals_previous_2
6 2018-04-01 westham fulham 1 0 1 NaN
4 2018-03-15 arsenal chelsea 2 2 1 3
7 2018-03-15 manutd westham 2 1 3 0
5 2018-02-20 manutd brighton 0 0 5 NaN
1 2018-02-08 arsenal liverpool 1 1 NaN 0
0 2018-02-03 manutd chelsea 3 1 2 2
2 2018-01-12 chelsea westham 2 0 NaN NaN
3 2018-01-12 liverpool manutd 0 2 NaN NaN
说明:
-2018年3月15日,阿森纳与切尔西交手。在前两场比赛中,切尔西一共进了3个球:客场1个,主场2个。
-之前的一些目标是Nan,因为我们没有之前比赛的数据
我试图通过一个团队一个团队地迭代来实现这一点,对于每个团队,我都在构建df的一个排序子集,然后可以聚合这些值,但我觉得这不是最好的解决方案,可以使用nice表达式来实现:
teams = pd.unique(df[['home_team', 'away_team']].values.ravel('K'))
for team in teams:
print(team)
team_df = df[(df['home_team']==team) | (df['away_team']==team)]
team_df.sort_values(['event_date'],inplace=True, ascending=False)
print(team_df)
如果不写循环和ifs,我怎么做呢 方法1::
#Create a df2 with index like a column a rename the columns to apply:
# pd.wide_to_long
df2=df.set_index('event_time',append=True)
df2.columns=[''.join(name[::-1]) for name in df2.columns.str.split('_')]
df2.columns=df2.columns.str.replace('home','1').str.replace('away','2')
df2=df2.reset_index()
#Using pd.wide_to_long
df_long=( pd.wide_to_long(df2,['team','goals'],i='level_0',j='key')
.sort_values('event_time',ascending=False) )
print(df_long)
event_time team goals
level_0 key
6 1 2018-04-01 westham 1
2 2018-04-01 fulham 0
4 1 2018-03-15 arsenal 2
7 1 2018-03-15 manutd 2
4 2 2018-03-15 chelsea 2
7 2 2018-03-15 westham 1
5 1 2018-02-20 manutd 0
2 2018-02-20 brighton 0
1 1 2018-02-08 arsenal 1
2 2018-02-08 liverpool 1
0 1 2018-02-03 manutd 3
2 2018-02-03 chelsea 1
2 1 2018-01-12 chelsea 2
3 1 2018-01-12 liverpool 0
2 2 2018-01-12 westham 0
3 2 2018-01-12 manutd 2
event_time home_team away_team home_goals away_goals \
6 2018-04-01 westham fulham 1 0
4 2018-03-15 arsenal chelsea 2 2
7 2018-03-15 manutd westham 2 1
5 2018-02-20 manutd brighton 0 0
1 2018-02-08 arsenal liverpool 1 1
0 2018-02-03 manutd chelsea 3 1
2 2018-01-12 chelsea westham 2 0
3 2018-01-12 liverpool manutd 0 2
h_goals_previous_2 a_goals_previous_2
6 1.0 NaN
4 NaN 3.0
7 3.0 NaN
5 5.0 NaN
1 NaN NaN
0 NaN NaN
2 NaN NaN
3 NaN NaN
方法2:
输出:
#Create a df2 with index like a column a rename the columns to apply:
# pd.wide_to_long
df2=df.set_index('event_time',append=True)
df2.columns=[''.join(name[::-1]) for name in df2.columns.str.split('_')]
df2.columns=df2.columns.str.replace('home','1').str.replace('away','2')
df2=df2.reset_index()
#Using pd.wide_to_long
df_long=( pd.wide_to_long(df2,['team','goals'],i='level_0',j='key')
.sort_values('event_time',ascending=False) )
print(df_long)
event_time team goals
level_0 key
6 1 2018-04-01 westham 1
2 2018-04-01 fulham 0
4 1 2018-03-15 arsenal 2
7 1 2018-03-15 manutd 2
4 2 2018-03-15 chelsea 2
7 2 2018-03-15 westham 1
5 1 2018-02-20 manutd 0
2 2018-02-20 brighton 0
1 1 2018-02-08 arsenal 1
2 2018-02-08 liverpool 1
0 1 2018-02-03 manutd 3
2 2018-02-03 chelsea 1
2 1 2018-01-12 chelsea 2
3 1 2018-01-12 liverpool 0
2 2 2018-01-12 westham 0
3 2 2018-01-12 manutd 2
event_time home_team away_team home_goals away_goals \
6 2018-04-01 westham fulham 1 0
4 2018-03-15 arsenal chelsea 2 2
7 2018-03-15 manutd westham 2 1
5 2018-02-20 manutd brighton 0 0
1 2018-02-08 arsenal liverpool 1 1
0 2018-02-03 manutd chelsea 3 1
2 2018-01-12 chelsea westham 2 0
3 2018-01-12 liverpool manutd 0 2
h_goals_previous_2 a_goals_previous_2
6 1.0 NaN
4 NaN 3.0
7 3.0 NaN
5 5.0 NaN
1 NaN NaN
0 NaN NaN
2 NaN NaN
3 NaN NaN
请注意,存在更多NaN值因为我只使用了数据框中显示的行你能解释一下“不管他们是主队还是客队”的意思吗?如果这是真的,那么为什么你有两个
goals\u previous\u 2
列。一个回家,一个出去。另外,如果你想要更快的反馈,我建议你用期望的输出完成这两列。输出会被更新。因为我想计算主队和客队的进球数,以及他们在前两场比赛中进了多少球。计算说明:-2018年3月15日阿森纳与切尔西比赛。在前两场比赛中,切尔西一共进了3个球:客场1个,主场2个之前的一些进球是Nan,因为我们没有之前比赛的数据。谢谢!这很有效,我学到了很多。我有一个关于为N个以前的匹配参数化它的问题,例如,大数据集的最后8个。可以使用什么来代替value\u 2\u sum=groups\u goals.shift(-1)+groups\u goals.shift(-2)
?我尝试了使用组\u goals.shift(-8).rolling(8).sum()
,但无法正确使用groupedby series。如果我在df['goals']上直接使用shift&rolling,那么它的求和是正确的(但这不是我想要实现的),但是在将它用于组\u目标之后,我得到了“奇怪”的结果-可能是索引的问题。
event_time home_team away_team home_goals away_goals \
6 2018-04-01 westham fulham 1 0
4 2018-03-15 arsenal chelsea 2 2
7 2018-03-15 manutd westham 2 1
5 2018-02-20 manutd brighton 0 0
1 2018-02-08 arsenal liverpool 1 1
0 2018-02-03 manutd chelsea 3 1
2 2018-01-12 chelsea westham 2 0
3 2018-01-12 liverpool manutd 0 2
h_goals_previous_2 a_goals_previous_2
6 1.0 NaN
4 NaN 3.0
7 3.0 NaN
5 5.0 NaN
1 NaN NaN
0 NaN NaN
2 NaN NaN
3 NaN NaN