Python 如何根据一系列if\else条件和匹配值从多个数据帧中最佳地提取信息?(需要指导!))

Python 如何根据一系列if\else条件和匹配值从多个数据帧中最佳地提取信息?(需要指导!)),python,pandas,dataframe,where-clause,Python,Pandas,Dataframe,Where Clause,我有三个数据帧,X,Y和事件。df_X有X个坐标,df_Y有Y个坐标,Events_df有一个已发生事件的列表,数据与篮球有关。您将通过查看以下内容了解它们是如何联系在一起的: df_Event: Seconds Passed Event Type Player 1.0 Passed The Ball Steve 2.0 Received Pass Michael 3.0 Touc

我有三个数据帧,X,Y和事件。df_X有X个坐标,df_Y有Y个坐标,Events_df有一个已发生事件的列表,数据与篮球有关。您将通过查看以下内容了解它们是如何联系在一起的:

df_Event:

Seconds Passed   Event Type         Player
1.0              Passed The Ball    Steve
2.0              Received Pass      Michael
3.0              Touch              Michael
4.0              Passed The Ball    Michael
5.0              Received The Ball  George


df_X:

Seconds Passed  Steve   Michael   George
1.0             11.43   12.33     15.33
2.0             11.45   12.46     13.22  
3.0             10.99   10.33     14.33           
4.0             11.34   10.36     11.22
5.0             12.43   12.22     11.78


df_Y:

....

(The Same As Above Just With Different Numbers)
我希望记录跨时间的事件模式,然后获取X,Y坐标,该坐标对应于跨每个数据帧传递的秒数列。例如,如果我想知道通行证从哪里开始和结束,我需要以下信息

我希望在标记为passs_df的新数据框中包含以下信息:

Passing Player   Receiving Player    X Coordinate PP   Y Coordinate PP  X Coordinate RP   Y Coordinate RP
Steve            Michael             11.43             ....             12.46             .....
我知道我可以使用以下方法:

Passes_df['Passing Player'] = df_Event['Player'].where(df_Event['Event'] == 'Pass').dropna()
Passes_df['Receiving Player'] = df_Event['Player'].shift(-1).where\
((df_Event['Event'] == 'Pass') & (df_Event['Event'].shift(-1) == 'Received Pass'))

然而,这似乎太冗长了?我是否可以使用一个能更流利地从每个源中提取信息的函数?我们将非常感谢您的帮助

您的问题缺少一些描述-当数据中的值不同时,您正在使用“事件”列并匹配“通过”。此处可能不建议使用两次换档操作,但使用一次可能还可以,尽管:

Passes_df = df_Event.copy()

Passes_df['X_Coordinate_PP'] = Passes_df.apply(lambda x: df_X.loc[df_X['Seconds Passed'] == x['Seconds_Passed']][x['Player']], axis=1)

Passes_df['Y_Coordinate_PP'] = Passes_df.apply(lambda x: df_Y.loc[df_Y['Seconds Passed'] == x['Seconds_Passed']][x['Player']], axis=1)

Passes_df['Passing Player'] = Passes_df.apply(lambda x: x['Player'] if x['Event Type'].contains('Pass') else None)

Passes_df['Receiving Player'] = Passes_df.apply(lambda x: x['Player'] if x['Event Type'].contains('Receive') else None)

Passes_df['X_Coordinate_RP'] = Passes_df['X_Coordinate_PP'].shift(-1)
Passes_df['Y_Coordinate_RP'] = Passes_df['Y_Coordinate_PP'].shift(-1)

Passes_df.drop(columns=['Player'], inplace=True)
Passes_df.dropna(inplace=True)
让我知道这是否有帮助

您可以使用pandas.pivot。。。为此:

假设它是按秒数排序的: df_事件[事件顺序]=df_事件.groupbyEvent类型.cumcount df_Event[X]=df_Event.mergedf_X,on=Seconds Passed.applylambda X:X[X[Player]],axis=1 df_Event[Y]=df_Event.mergedf_Y,on=Seconds Passed.applylambda x:x[x[Player]],axis=1 df=df\u Event.pivotindex=Event\u order,columns=Event Type,value=[Player,X,Y] 要展平索引列,请执行以下操作: df.columns=maplambda x:ux.joinx,df.columns 假定

df_X:
   Seconds  Steve  Michael  George
0      1.0  11.43    12.33   15.33
1      2.0  11.45    12.46   13.22
2      3.0  10.99    10.33   14.33
3      4.0  11.34    10.36   11.22
4      5.0  12.43    12.22   11.78

df_e:
   Seconds     Event   Player
0      1.0    Passed    Steve
1      2.0  Received  Michael
2      3.0     Touch  Michael
3      4.0    Passed  Michael
4      5.0  Received   George
df_Y与df_X类似

首先拆下df_X和df_Y的堆栈,并将它们与df_E连接,以在单个df中获取所有信息

df_X = df_X.set_index('Seconds').stack().rename_axis(['Seconds', 'Player']).rename('X')
df_Y = df_Y.set_index('Seconds').stack().rename_axis(['Seconds', 'Player']).rename('Y')
df_e = df_e.set_index(['Seconds', 'Player'])
df_e = df_e.join(df_X).join(df_Y).reset_index(level='Player')

df_e:
          Player     Event      X      Y
Seconds                                 
1.0        Steve    Passed  11.43  11.43
2.0      Michael  Received  12.46  12.46
3.0      Michael     Touch  10.33  10.33
4.0      Michael    Passed  10.36  10.36
5.0       George  Received  11.78  11.78

现在只选择与传递相关的事件,即传递和接收

参加连续的活动:

df_pe = df_pe.join(df_pe.shift(-1), rsuffix='_1')
然后只保留“收到”的通行证

df_passes = df_pe[df_pe.Event_1 == 'Received']

          Player   Event      X      Y Player_1   Event_1    X_1    Y_1
Seconds                                                                
1.0        Steve  Passed  11.43  11.43  Michael  Received  12.46  12.46
4.0      Michael  Passed  10.36  10.36   George  Received  11.78  11.78

问题的解决需要一种系统的方法,如果对问题的理解发生变化,这种方法将产生重大影响。因为在所提出的问题中,输出数据帧排除了事件类型“Touch”,并且只比较了传递和接收;因此,我采取了达到这一产出的方法

X和Y坐标数据帧不整洁。我们需要通过pd.melt功能使它们保持整齐。 合并事件,X cordinate和Y cordinate通过pd.Merge函数将数据合并到单个数据帧中。 创建传递和接收的单独数据帧。 因为“Seconds Passed”是唯一的列,所以我假设传递和接收有1秒的延迟。因此,从接收数据帧中移除1秒。 合并传递数据帧和接收数据帧。 作为惯例,我用pd代替熊猫

步骤1:以整洁的形式提供数据 步骤2:将事件、X Cordinate和Y Cordinate数据合并到单个数据帧中 现在你有了全面的事件数据,包括坐标和玩家

步骤3:创建过程和接收的单独数据帧。 最后:Merge通过接收数据帧传递数据帧。 结果/产出:
如果你想知道球是如何移动的,包括运球,你可以做以下操作:

0:我将输入简化了一点: 1:将X和Y合并为坐标的元组: 当然,您不必使用元组,也可以使用@ckedar给出的解决方案,稍后相应地调整关键点

df_merged = df_x.merge(df_y, on=['seconds']).set_index('seconds')
df_merged=df_merged.groupby(df_merged.columns.str.split('_').str[0],axis=1).agg(lambda x : tuple(x.tolist()))
结果:

                 George         Michael           Steve
seconds                                                
1        (15.33, 10.33)   (12.33, 9.33)  (11.43, 13.43)
2        (13.22, 10.22)   (12.46, 9.46)  (11.45, 13.45)
3        (14.33, 10.33)  (10.33, 10.53)  (10.99, 12.99)
4        (11.22, 11.72)  (10.36, 12.36)  (11.34, 12.34)
5        (11.78, 12.78)  (12.22, 12.72)   (12.43, 12.0)
   seconds     event   player           coord
0        1    passed    Steve  (11.43, 13.43)
1        2  received  Michael   (12.46, 9.46)
2        3     touch  Michael  (10.33, 10.53)
3        4    passed  Michael  (10.36, 12.36)
4        5  received   George  (11.78, 12.78)
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)
Passes:
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)

Dribbling:
           from       to      from_coord        to_coord
second                                                  
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)
2:在事件数据框中放置坐标: 结果:

                 George         Michael           Steve
seconds                                                
1        (15.33, 10.33)   (12.33, 9.33)  (11.43, 13.43)
2        (13.22, 10.22)   (12.46, 9.46)  (11.45, 13.45)
3        (14.33, 10.33)  (10.33, 10.53)  (10.99, 12.99)
4        (11.22, 11.72)  (10.36, 12.36)  (11.34, 12.34)
5        (11.78, 12.78)  (12.22, 12.72)   (12.43, 12.0)
   seconds     event   player           coord
0        1    passed    Steve  (11.43, 13.43)
1        2  received  Michael   (12.46, 9.46)
2        3     touch  Michael  (10.33, 10.53)
3        4    passed  Michael  (10.36, 12.36)
4        5  received   George  (11.78, 12.78)
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)
Passes:
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)

Dribbling:
           from       to      from_coord        to_coord
second                                                  
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)
3:使用from和to信息创建新df: 在这里,如果不使用元组作为坐标,而是使用X和Y,则必须调整到不同的键

df_new = pd.DataFrame(columns=['second', 'from', 'to', 'from_coord', 'to_coord'])
df_new[['second', 'from', 'from_coord']] = df_e[['seconds', 'player', 'coord']].iloc[:-1]
df_new[['to', 'to_coord']] = df_e[['player', 'coord']].iloc[1:].reset_index().drop('index',axis=1)
df_new = df_new.set_index('second')
结果:

                 George         Michael           Steve
seconds                                                
1        (15.33, 10.33)   (12.33, 9.33)  (11.43, 13.43)
2        (13.22, 10.22)   (12.46, 9.46)  (11.45, 13.45)
3        (14.33, 10.33)  (10.33, 10.53)  (10.99, 12.99)
4        (11.22, 11.72)  (10.36, 12.36)  (11.34, 12.34)
5        (11.78, 12.78)  (12.22, 12.72)   (12.43, 12.0)
   seconds     event   player           coord
0        1    passed    Steve  (11.43, 13.43)
1        2  received  Michael   (12.46, 9.46)
2        3     touch  Michael  (10.33, 10.53)
3        4    passed  Michael  (10.36, 12.36)
4        5  received   George  (11.78, 12.78)
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)
Passes:
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)

Dribbling:
           from       to      from_coord        to_coord
second                                                  
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)
4可选:现在你可以传球和运球: 结果:

                 George         Michael           Steve
seconds                                                
1        (15.33, 10.33)   (12.33, 9.33)  (11.43, 13.43)
2        (13.22, 10.22)   (12.46, 9.46)  (11.45, 13.45)
3        (14.33, 10.33)  (10.33, 10.53)  (10.99, 12.99)
4        (11.22, 11.72)  (10.36, 12.36)  (11.34, 12.34)
5        (11.78, 12.78)  (12.22, 12.72)   (12.43, 12.0)
   seconds     event   player           coord
0        1    passed    Steve  (11.43, 13.43)
1        2  received  Michael   (12.46, 9.46)
2        3     touch  Michael  (10.33, 10.53)
3        4    passed  Michael  (10.36, 12.36)
4        5  received   George  (11.78, 12.78)
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)
Passes:
           from       to      from_coord        to_coord
second                                                  
1         Steve  Michael  (11.43, 13.43)   (12.46, 9.46)
4       Michael   George  (10.36, 12.36)  (11.78, 12.78)

Dribbling:
           from       to      from_coord        to_coord
second                                                  
2       Michael  Michael   (12.46, 9.46)  (10.33, 10.53)
3       Michael  Michael  (10.33, 10.53)  (10.36, 12.36)

谢谢如果我想记录“传球”和“接球”之间的X和Y坐标。你会怎么做?有些我总是想看一些系列的事件,并获取经过的秒之间的坐标。看看你的例子,在1到4秒之间,我如何获得坐标?@Sam如果你想保持df_事件中所示的事件顺序,那么在最后一步中使用“df”而不是带有左连接参数的“df_passs”。如果您喜欢我的答案,请将其标记为答案。@HussainMansoor减少1接收到的df_传递的秒数-是因为在示例中,两次传递的数据都是在下一秒接收到的吗?如果传递和接收之间的时间间隔较长怎么办?@ckedar现实生活中的数据可能有很多问题,包括您提到的会影响方法的问题。@HussainMansoor是的,就是这样。有没有办法做到这一点。例如,一名球员运球吗?我想提取这些信息