Python行到时间序列列
我一直在分析PGA巡回赛的数据。出于机器学习的目的,我希望列数据能够表示几周内的统计数据。下面是原始数据结构的示例Python行到时间序列列,python,pandas,dataframe,Python,Pandas,Dataframe,我一直在分析PGA巡回赛的数据。出于机器学习的目的,我希望列数据能够表示几周内的统计数据。下面是原始数据结构的示例 import pandas as pd import numpy as np data = {'Player Name':['Tiger','Tiger','Tiger','Tiger','Tiger','Tiger','Jack', 'Jack','Jack','Jack','Jack','Jack','Jack'],
import pandas as pd
import numpy as np
data = {'Player Name':['Tiger','Tiger','Tiger','Tiger','Tiger','Tiger','Jack',
'Jack','Jack','Jack','Jack','Jack','Jack'],
'Date':[1, 2, 4, 6, 7, 9, 1, 3, 4, 6, 9, 10, 11],
'SG Total':[13, 2, 14, 6, 8, 1, 1, 3, 8, 4, 9, 2, 1]}
df_original = pd.DataFrame(data)
我想获得以下格式的数据
data = {'Player Name':['Tiger','Tiger','Tiger','Jack','Jack',
'Jack','Jack'],
'Date':[6, 7, 9, 6, 9, 10, 11],
'SG Total (Date t-3)':[13, 2, 14, 1, 3, 8, 4],
'SG Total (Date t-2)':[2, 14, 6, 3, 8, 4, 9],
'SG Total (Date t-1)':[14, 6, 8, 8, 4, 9, 2],
'SG Total (Date y)': [6, 8, 1, 4, 9, 2, 1]}
df_correct = pd.DataFrame(data)
在我使用的真实数据集中,我有大约1000列。因此,新的所需数据集可能有4000列。正如您在所需的数据集中所看到的,我删除了每个玩家的前3周。由于我使用前3周的数据填写(t-3)、(t-2)和(t-1),因此我从个人数据的第4周开始计算日期
我最初为每个星期创建了一个数据集,不管玩家是否玩过,并使用此代码创建了所需的数据帧
#%% Creates weekly dataframes & predictions dataframes
#Creates dataframes of each week
dict_of_weeks = {}
for i in range(1,df_numeric_combined['Date'].nunique()+1):
dict_of_weeks['Week_{}_df'.format(i)] = df_numeric_combined[df_numeric_combined['Date'] == i]
dict_of_weeks['Week_{}_df'.format(i)].columns += ' (Week ' + str(i) + ')'
dict_of_weeks['Week_{}_df'.format(i)].rename(columns={'Player Name (Week ' + str(i) + ')' : 'Player Name' , 'Date (Week ' + str(i) + ')' : 'Date'},inplace=True)
#Creating dataframes for prediction of each week
import functools
dict_of_predictions = {}
df_weeks = []
for i in range(4,df_numeric_combined['Date'].nunique()+1):
dfs = [dict_of_weeks['Week_'+str(i-3)+'_df'], dict_of_weeks['Week_'+str(i-2)+'_df'], dict_of_weeks['Week_'+str(i-1)+'_df'], dict_of_weeks['Week_'+str(i)+'_df']]
dict_of_predictions['Week_{}_predictions'.format(i)] = functools.reduce(lambda left,right: pd.merge(left,right,on=['Player Name'], how='outer'), dfs)
cols = []
count = 1
for column in dict_of_predictions['Week_{}_predictions'.format(i)].columns:
if column == 'Date_y':
cols.append('Date_y_'+ str(count))
count+=1
continue
cols.append(column)
dict_of_predictions['Week_{}_predictions'.format(i)].columns = cols
dict_of_predictions['Week_{}_predictions'.format(i)].drop(columns = ['Date_x', 'Date_y_1'],inplace = True)
dict_of_predictions['Week_{}_predictions'.format(i)].rename(columns={'Date_y_2':'Date'},inplace=True)
dict_of_predictions['Week_{}_predictions'.format(i)].columns = dict_of_predictions['Week_{}_predictions'.format(i)].columns.str.replace('(Week ' + str(i-3)+ ')', 'Week t-3').str.replace('(Week ' + str(i-2)+ ')', 'Week t-2').str.replace('(Week ' + str(i-1)+ ')', 'Week t-1').str.replace('(Week ' + str(i)+ ')', 'Week y')
df_weeks.append(dict_of_predictions['Week_{}_predictions'.format(i)])
#Combines predictions dataframes
df = pd.concat(dict_of_predictions.values(), axis=0, join='inner')
然而,我创建的这段代码只有在玩家连续玩了几周时才有效,因为它依赖于周数,并减去3、2和1
最终目标是获得df_正确格式的数据
谢谢 如果我正确理解您的要求,您可以使用
groupby
在排序数据框中使用shift
,为每位玩家完成前一周的结果:
##首先按玩家和日期排序
df_corrected=df_original.sort_值(['Player Name','Date']))
您的_列=['SG Total']##在此处列出您的4000列
对于_列中的列:
对于[3,2,1,0]中的s:#####时间流逝
df_corrected[f'{col}(日期t-{s})]=df_corrected.groupby('Player Name')[col].shift(s)
df_corrected.drop(您的_列,axis=1,inplace=True)
哪个输出
Out[12]:
球员姓名日期SG总计(日期t-3)SG总计(日期t-2)\
6杰克1楠楠
7杰克3楠楠
8 Jack 4 NaN 1.0
9插孔6 1.0 3.0
10插孔9 3.0 8.0
11插孔10 8.0 4.0
12插孔11 4.0 9.0
0老虎1楠楠
1虎2楠楠
2虎4南13.0
3老虎613.02.0
4老虎7 2.0 14.0
5老虎914.06.0
SG总计(日期t-1)SG总计(日期t-0)
6南1
7 1.0 3
8 3.0 8
9 8.0 4
10 4.0 9
11 9.0 2
12 2.0 1
0南13
1 13.0 2
2 2.0 14
3 14.0 6
4 6.0 8
5 8.0 1