Python 熊猫-多条件查找速度
我正在处理一些历史棒球数据,并试图获得以往比赛的比赛信息(击球手/投手) 示例数据:Python 熊猫-多条件查找速度,python,pandas,performance,Python,Pandas,Performance,我正在处理一些历史棒球数据,并试图获得以往比赛的比赛信息(击球手/投手) 示例数据: import pandas as pd data = {'ID': ['A','A','A','A','A','A','B','B','B','B','B'], 'Year' : ['2017-05-01', '2017-06-03', '2017-08-02', '2018-05-30', '2018-07-23', '2018-09-14', '2017-06-01', '2017-08-
import pandas as pd
data = {'ID': ['A','A','A','A','A','A','B','B','B','B','B'],
'Year' : ['2017-05-01', '2017-06-03', '2017-08-02', '2018-05-30', '2018-07-23', '2018-09-14', '2017-06-01', '2017-08-03', '2018-05-15', '2018-07-23', '2017-05-01'],
'ID2' : [1,2,3,2,2,1,2,2,2,1,1],
'Score 2': [1,4,5,7,5,5,6,1,4,5,6],
'Score 3': [1,4,5,7,5,5,6,1,4,5,6],
'Score 4': [1,4,5,7,5,5,6,1,4,5,6]}
df = pd.DataFrame(data)
lookup_data = {"First_Person" : ['A', 'B'],
"Second_Person" : ['1', '2'],
"Year" : ['2018', '2018']}
lookup_df = pd.DataFrame(lookup_data)
查找df具有当前匹配,df具有历史数据和当前匹配
我想找出,例如,对于A个人和2个人,他们在之前任何一天的比赛结果是什么
我可以这样做:
history_list = []
def get_history(row, df, hist_list):
#we filter the df to matchups containing both players before the previous date and sum all events in their history
history = df[(df['ID'] == row['First_Person']) & (df['ID2'] == row['Second_Person']) & (df['Year'] < row['Year'])].sum().iloc[3:]
#add to a list to keep track of results
hist_list.append(list(history.values) + [row['Year']+row['First_Person']+row['Second_Person']])
预期结果如下:
1st P Matchup date 2nd p Historical scores
A 2018-07-23 2 11 11 11
B 2018-05-15 2 7 7 7
但这相当慢-过滤操作每次查找大约需要50毫秒
有没有更好的方法来解决这个问题?目前需要3个多小时才能完成25万场历史比赛 您可以合并或映射和groupby
lookup_df['Second_Person'] = lookup_df['Second_Person'].astype(int)
merged = df.merge(lookup_df, left_on = ['ID', 'ID2'], right_on = ['First_Person', 'Second_Person'], how = 'left').query('Year_x < Year_y').drop(['Year_x', 'First_Person', 'Second_Person', 'Year_y'], axis = 1)
merged.groupby('ID', as_index = False).sum()
ID ID2 Score 2 Score 3 Score 4
0 A 1 1 1 1
1 B 4 7 7 7
lookup_df['Second_Person']=lookup_df['Second_Person'].astype(int)
merged=df.merge(lookup_df,left_on=['ID','ID2'],right_on=['First_Person','Second_Person'],how='left')。查询('Year_x
对不起-我意识到我需要更好地澄清。我需要当前日期之前的任何日期,而不仅仅是年份。我更新了我的原始帖子以显示此内容。在您的最后一行日志中有一个日期0df@ctd25,我不明白你怎么会得到ID为A的11。只有一行ID为A,ID为2 1,与查找匹配(是的,我很匆忙,把它弄混了:)但是我能够测试这个,它工作得非常好!我刚刚使用了.query(date\uxlookup_df['Second_Person'] = lookup_df['Second_Person'].astype(int)
merged = df.merge(lookup_df, left_on = ['ID', 'ID2'], right_on = ['First_Person', 'Second_Person'], how = 'left').query('Year_x < Year_y').drop(['Year_x', 'First_Person', 'Second_Person', 'Year_y'], axis = 1)
merged.groupby('ID', as_index = False).sum()
ID ID2 Score 2 Score 3 Score 4
0 A 1 1 1 1
1 B 4 7 7 7