Python 如何根据一系列if\else条件和匹配值从多个数据帧中最佳地提取信息?(需要指导!))
我有三个数据帧,X,Y和事件。df_X有X个坐标,df_Y有Y个坐标,Events_df有一个已发生事件的列表,数据与篮球有关。您将通过查看以下内容了解它们是如何联系在一起的:Python 如何根据一系列if\else条件和匹配值从多个数据帧中最佳地提取信息?(需要指导!)),python,pandas,dataframe,where-clause,Python,Pandas,Dataframe,Where Clause,我有三个数据帧,X,Y和事件。df_X有X个坐标,df_Y有Y个坐标,Events_df有一个已发生事件的列表,数据与篮球有关。您将通过查看以下内容了解它们是如何联系在一起的: df_Event: Seconds Passed Event Type Player 1.0 Passed The Ball Steve 2.0 Received Pass Michael 3.0 Touc
df_Event:
Seconds Passed Event Type Player
1.0 Passed The Ball Steve
2.0 Received Pass Michael
3.0 Touch Michael
4.0 Passed The Ball Michael
5.0 Received The Ball George
df_X:
Seconds Passed Steve Michael George
1.0 11.43 12.33 15.33
2.0 11.45 12.46 13.22
3.0 10.99 10.33 14.33
4.0 11.34 10.36 11.22
5.0 12.43 12.22 11.78
df_Y:
....
(The Same As Above Just With Different Numbers)
我希望记录跨时间的事件模式,然后获取X,Y坐标,该坐标对应于跨每个数据帧传递的秒数列。例如,如果我想知道通行证从哪里开始和结束,我需要以下信息
我希望在标记为passs_df的新数据框中包含以下信息:
Passing Player Receiving Player X Coordinate PP Y Coordinate PP X Coordinate RP Y Coordinate RP
Steve Michael 11.43 .... 12.46 .....
我知道我可以使用以下方法:
Passes_df['Passing Player'] = df_Event['Player'].where(df_Event['Event'] == 'Pass').dropna()
Passes_df['Receiving Player'] = df_Event['Player'].shift(-1).where\
((df_Event['Event'] == 'Pass') & (df_Event['Event'].shift(-1) == 'Received Pass'))
然而,这似乎太冗长了?我是否可以使用一个能更流利地从每个源中提取信息的函数?我们将非常感谢您的帮助 您的问题缺少一些描述-当数据中的值不同时,您正在使用“事件”列并匹配“通过”。此处可能不建议使用两次换档操作,但使用一次可能还可以,尽管:
Passes_df = df_Event.copy()
Passes_df['X_Coordinate_PP'] = Passes_df.apply(lambda x: df_X.loc[df_X['Seconds Passed'] == x['Seconds_Passed']][x['Player']], axis=1)
Passes_df['Y_Coordinate_PP'] = Passes_df.apply(lambda x: df_Y.loc[df_Y['Seconds Passed'] == x['Seconds_Passed']][x['Player']], axis=1)
Passes_df['Passing Player'] = Passes_df.apply(lambda x: x['Player'] if x['Event Type'].contains('Pass') else None)
Passes_df['Receiving Player'] = Passes_df.apply(lambda x: x['Player'] if x['Event Type'].contains('Receive') else None)
Passes_df['X_Coordinate_RP'] = Passes_df['X_Coordinate_PP'].shift(-1)
Passes_df['Y_Coordinate_RP'] = Passes_df['Y_Coordinate_PP'].shift(-1)
Passes_df.drop(columns=['Player'], inplace=True)
Passes_df.dropna(inplace=True)
让我知道这是否有帮助 您可以使用pandas.pivot。。。为此:
假设它是按秒数排序的:
df_事件[事件顺序]=df_事件.groupbyEvent类型.cumcount
df_Event[X]=df_Event.mergedf_X,on=Seconds Passed.applylambda X:X[X[Player]],axis=1
df_Event[Y]=df_Event.mergedf_Y,on=Seconds Passed.applylambda x:x[x[Player]],axis=1
df=df\u Event.pivotindex=Event\u order,columns=Event Type,value=[Player,X,Y]
要展平索引列,请执行以下操作:
df.columns=maplambda x:ux.joinx,df.columns
假定
df_X:
Seconds Steve Michael George
0 1.0 11.43 12.33 15.33
1 2.0 11.45 12.46 13.22
2 3.0 10.99 10.33 14.33
3 4.0 11.34 10.36 11.22
4 5.0 12.43 12.22 11.78
df_e:
Seconds Event Player
0 1.0 Passed Steve
1 2.0 Received Michael
2 3.0 Touch Michael
3 4.0 Passed Michael
4 5.0 Received George
df_Y与df_X类似
首先拆下df_X和df_Y的堆栈,并将它们与df_E连接,以在单个df中获取所有信息
df_X = df_X.set_index('Seconds').stack().rename_axis(['Seconds', 'Player']).rename('X')
df_Y = df_Y.set_index('Seconds').stack().rename_axis(['Seconds', 'Player']).rename('Y')
df_e = df_e.set_index(['Seconds', 'Player'])
df_e = df_e.join(df_X).join(df_Y).reset_index(level='Player')
df_e:
Player Event X Y
Seconds
1.0 Steve Passed 11.43 11.43
2.0 Michael Received 12.46 12.46
3.0 Michael Touch 10.33 10.33
4.0 Michael Passed 10.36 10.36
5.0 George Received 11.78 11.78
现在只选择与传递相关的事件,即传递和接收
参加连续的活动:
df_pe = df_pe.join(df_pe.shift(-1), rsuffix='_1')
然后只保留“收到”的通行证
df_passes = df_pe[df_pe.Event_1 == 'Received']
Player Event X Y Player_1 Event_1 X_1 Y_1
Seconds
1.0 Steve Passed 11.43 11.43 Michael Received 12.46 12.46
4.0 Michael Passed 10.36 10.36 George Received 11.78 11.78
问题的解决需要一种系统的方法,如果对问题的理解发生变化,这种方法将产生重大影响。因为在所提出的问题中,输出数据帧排除了事件类型“Touch”,并且只比较了传递和接收;因此,我采取了达到这一产出的方法 X和Y坐标数据帧不整洁。我们需要通过pd.melt功能使它们保持整齐。 合并事件,X cordinate和Y cordinate通过pd.Merge函数将数据合并到单个数据帧中。 创建传递和接收的单独数据帧。 因为“Seconds Passed”是唯一的列,所以我假设传递和接收有1秒的延迟。因此,从接收数据帧中移除1秒。 合并传递数据帧和接收数据帧。 作为惯例,我用pd代替熊猫 步骤1:以整洁的形式提供数据 步骤2:将事件、X Cordinate和Y Cordinate数据合并到单个数据帧中 现在你有了全面的事件数据,包括坐标和玩家 步骤3:创建过程和接收的单独数据帧。 最后:Merge通过接收数据帧传递数据帧。 结果/产出:
如果你想知道球是如何移动的,包括运球,你可以做以下操作: 0:我将输入简化了一点: 1:将X和Y合并为坐标的元组: 当然,您不必使用元组,也可以使用@ckedar给出的解决方案,稍后相应地调整关键点
df_merged = df_x.merge(df_y, on=['seconds']).set_index('seconds')
df_merged=df_merged.groupby(df_merged.columns.str.split('_').str[0],axis=1).agg(lambda x : tuple(x.tolist()))
结果:
George Michael Steve
seconds
1 (15.33, 10.33) (12.33, 9.33) (11.43, 13.43)
2 (13.22, 10.22) (12.46, 9.46) (11.45, 13.45)
3 (14.33, 10.33) (10.33, 10.53) (10.99, 12.99)
4 (11.22, 11.72) (10.36, 12.36) (11.34, 12.34)
5 (11.78, 12.78) (12.22, 12.72) (12.43, 12.0)
seconds event player coord
0 1 passed Steve (11.43, 13.43)
1 2 received Michael (12.46, 9.46)
2 3 touch Michael (10.33, 10.53)
3 4 passed Michael (10.36, 12.36)
4 5 received George (11.78, 12.78)
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Passes:
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Dribbling:
from to from_coord to_coord
second
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
2:在事件数据框中放置坐标:
结果:
George Michael Steve
seconds
1 (15.33, 10.33) (12.33, 9.33) (11.43, 13.43)
2 (13.22, 10.22) (12.46, 9.46) (11.45, 13.45)
3 (14.33, 10.33) (10.33, 10.53) (10.99, 12.99)
4 (11.22, 11.72) (10.36, 12.36) (11.34, 12.34)
5 (11.78, 12.78) (12.22, 12.72) (12.43, 12.0)
seconds event player coord
0 1 passed Steve (11.43, 13.43)
1 2 received Michael (12.46, 9.46)
2 3 touch Michael (10.33, 10.53)
3 4 passed Michael (10.36, 12.36)
4 5 received George (11.78, 12.78)
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Passes:
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Dribbling:
from to from_coord to_coord
second
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
3:使用from和to信息创建新df:
在这里,如果不使用元组作为坐标,而是使用X和Y,则必须调整到不同的键
df_new = pd.DataFrame(columns=['second', 'from', 'to', 'from_coord', 'to_coord'])
df_new[['second', 'from', 'from_coord']] = df_e[['seconds', 'player', 'coord']].iloc[:-1]
df_new[['to', 'to_coord']] = df_e[['player', 'coord']].iloc[1:].reset_index().drop('index',axis=1)
df_new = df_new.set_index('second')
结果:
George Michael Steve
seconds
1 (15.33, 10.33) (12.33, 9.33) (11.43, 13.43)
2 (13.22, 10.22) (12.46, 9.46) (11.45, 13.45)
3 (14.33, 10.33) (10.33, 10.53) (10.99, 12.99)
4 (11.22, 11.72) (10.36, 12.36) (11.34, 12.34)
5 (11.78, 12.78) (12.22, 12.72) (12.43, 12.0)
seconds event player coord
0 1 passed Steve (11.43, 13.43)
1 2 received Michael (12.46, 9.46)
2 3 touch Michael (10.33, 10.53)
3 4 passed Michael (10.36, 12.36)
4 5 received George (11.78, 12.78)
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Passes:
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Dribbling:
from to from_coord to_coord
second
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
4可选:现在你可以传球和运球:
结果:
George Michael Steve
seconds
1 (15.33, 10.33) (12.33, 9.33) (11.43, 13.43)
2 (13.22, 10.22) (12.46, 9.46) (11.45, 13.45)
3 (14.33, 10.33) (10.33, 10.53) (10.99, 12.99)
4 (11.22, 11.72) (10.36, 12.36) (11.34, 12.34)
5 (11.78, 12.78) (12.22, 12.72) (12.43, 12.0)
seconds event player coord
0 1 passed Steve (11.43, 13.43)
1 2 received Michael (12.46, 9.46)
2 3 touch Michael (10.33, 10.53)
3 4 passed Michael (10.36, 12.36)
4 5 received George (11.78, 12.78)
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Passes:
from to from_coord to_coord
second
1 Steve Michael (11.43, 13.43) (12.46, 9.46)
4 Michael George (10.36, 12.36) (11.78, 12.78)
Dribbling:
from to from_coord to_coord
second
2 Michael Michael (12.46, 9.46) (10.33, 10.53)
3 Michael Michael (10.33, 10.53) (10.36, 12.36)
谢谢如果我想记录“传球”和“接球”之间的X和Y坐标。你会怎么做?有些我总是想看一些系列的事件,并获取经过的秒之间的坐标。看看你的例子,在1到4秒之间,我如何获得坐标?@Sam如果你想保持df_事件中所示的事件顺序,那么在最后一步中使用“df”而不是带有左连接参数的“df_passs”。如果您喜欢我的答案,请将其标记为答案。@HussainMansoor减少1接收到的df_传递的秒数-是因为在示例中,两次传递的数据都是在下一秒接收到的吗?如果传递和接收之间的时间间隔较长怎么办?@ckedar现实生活中的数据可能有很多问题,包括您提到的会影响方法的问题。@HussainMansoor是的,就是这样。有没有办法做到这一点。例如,一名球员运球吗?我想提取这些信息