Python 从数据帧创建滑动窗口组
我正在尝试对数据进行预处理,以解决ML回归问题。Python 从数据帧创建滑动窗口组,python,pandas,Python,Pandas,我正在尝试对数据进行预处理,以解决ML回归问题。 使用以下(简化)数据框: grp day score 0 A 1 2 1 A 1 4 2 A 2 6 3 A 2 8 4 A 3 10 5 A 3 12 6 A 4 14 7 A 4 16 8 A 5 18 9 A 5 20 10 B 1
使用以下(简化)数据框:
grp day score
0 A 1 2
1 A 1 4
2 A 2 6
3 A 2 8
4 A 3 10
5 A 3 12
6 A 4 14
7 A 4 16
8 A 5 18
9 A 5 20
10 B 1 2
11 B 2 4
12 B 3 8
13 B 4 16
14 B 5 32
我正试图根据day列创建一个“滑动窗口”序列列表,因此如果我有X天,则前两天的目标分数为Y天
在下面的示例中,我在每组中有5天,每2天我都会看到未来2天的目标,当我到达数据帧末尾时停止:
grp day score
0 A 1 2
1 A 1 4
2 A 2 6
3 A 2 8
4 A 3 10
5 A 3 12
6 A 4 14
7 A 4 16
8 A 5 18
9 A 5 20
10 B 1 2
11 B 2 4
12 B 3 8
13 B 4 16
14 B 5 32
例如,这里是A组的前两组:
grp day score target
0 A 1 2 16
1 A 1 4 16
2 A 2 6 16
3 A 2 8 16 <- last score value of day 4 (group A)
grp day score target
0 A 2 6 20
1 A 2 8 20
2 A 3 10 20
3 A 3 12 20 <- last score value of day 5 (group A)
但是我有点迷路了。。。任何帮助都将不胜感激
更新:
我已经写了以下笨拙的代码,让我去。。。我怎样才能改进它
import pandas as pd
df = pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B'],
'day':['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
'score':[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
})
print(df.head(15))
df2 = pd.DataFrame({'grp':[],
'day':[],
'score':[]})
groups = df.groupby(['grp'])
GROUP_SIZE = 2
LOOK_AHEAD = 2
sequences = []
for _,grp in groups:
days_row_index = grp['day'].factorize()[0]
days_group = grp.groupby(days_row_index)
for _,day in days_group:
day_index = int(day['day'].values[0])
if day_index + LOOK_AHEAD < len(days_group):
target = days_group.get_group(day_index + LOOK_AHEAD)['score'].values[-1]
print(day_index,day_index + LOOK_AHEAD,day['score'].values[-1],"----------->",target)
day['target'] = target
df2 = pd.concat([df2,day])
for i in range(0, GROUP_SIZE-1):
if day_index + i >= len(days_group):
break
next_day = days_group.get_group(day_index + i)
next_day['target'] = target
df2 = pd.concat([df2,next_day])
sequences.append(df2.copy())
df2 = df2.iloc[0:0]
sequences
将熊猫作为pd导入
df=pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B',],
“day”:['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
“分数”:[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
})
打印(测向头(15))
df2=pd.DataFrame({'grp':[],
“日”:[],
“分数”:[]})
groups=df.groupby(['grp'])
组大小=2
向前看=2
序列=[]
对于组中的grp:
天\行\索引=grp['day'].factorize()[0]
天数组=grp.groupby(天数行索引)
对于u,天中的天u组:
day_index=int(day['day'].值[0])
如果日指数+前瞻”,目标)
日期['target']=目标
df2=pd.concat([df2,天])
对于范围内的i(0,组大小为1):
如果天索引+i>=len(天组):
打破
下一天=天组。获取天组(天索引+i)
下一天['target']=目标
df2=pd.concat([df2,下一天])
sequences.append(df2.copy())
df2=df2.iloc[0:0]
序列
在您提出的解决方案的基础上,我编写了这段代码,我确信它可以被优化,所以任何人都可以改进它。让我知道这是否是您想要的(我冒昧地创建了另一个“混合”组“C”,以测试更通用的方法)
将熊猫作为pd导入
#创建测试数据帧
df=[
[A',1,2],
[A',1,4],
[A',2,6],
[A',2,8],
[A',3,10],
[A',3,12],
[A',4,14],
[A',4,16],
[A',5,18],
[A',5,20],
[B',1,2],
[B',2,4],
[B',3,8],
[B',4,16],
[B',5,32],
[C',1,2],
[C',1,4],
[C',2,8],
[C',3,16],
[C',3,20],
[C',4,24],
[C',5,28]
]
df=pd.DataFrame(df,列=['grp','day','score'])
#加工
groups=df.groupby(['grp'])
对于组中的grp:
天\行\索引=grp['day'].factorize()[0]
i=最小值(天数、行数、索引)
而i
那是一个打字错误,谢谢,很抱歉删除了评论,我意识到这是一个明显的打字错误,所以它实际上不需要评论。无论如何,我还有一个问题,B组第2天和第4分的目标应该是32,对吗?没关系,我现在明白为什么是16而不是32了。
import pandas as pd
df = pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B'],
'day':['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
'score':[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
})
print(df.head(15))
df2 = pd.DataFrame({'grp':[],
'day':[],
'score':[]})
groups = df.groupby(['grp'])
GROUP_SIZE = 2
LOOK_AHEAD = 2
sequences = []
for _,grp in groups:
days_row_index = grp['day'].factorize()[0]
days_group = grp.groupby(days_row_index)
for _,day in days_group:
day_index = int(day['day'].values[0])
if day_index + LOOK_AHEAD < len(days_group):
target = days_group.get_group(day_index + LOOK_AHEAD)['score'].values[-1]
print(day_index,day_index + LOOK_AHEAD,day['score'].values[-1],"----------->",target)
day['target'] = target
df2 = pd.concat([df2,day])
for i in range(0, GROUP_SIZE-1):
if day_index + i >= len(days_group):
break
next_day = days_group.get_group(day_index + i)
next_day['target'] = target
df2 = pd.concat([df2,next_day])
sequences.append(df2.copy())
df2 = df2.iloc[0:0]
sequences
import pandas as pd
# Create test dataframe
df = [
['A', 1, 2],
['A', 1, 4],
['A', 2, 6],
['A', 2, 8],
['A', 3, 10],
['A', 3, 12],
['A', 4, 14],
['A', 4, 16],
['A', 5, 18],
['A', 5, 20],
['B', 1, 2],
['B', 2, 4],
['B', 3, 8],
['B', 4, 16],
['B', 5, 32],
['C', 1, 2],
['C', 1, 4],
['C', 2, 8],
['C', 3, 16],
['C', 3, 20],
['C', 4, 24],
['C', 5, 28]
]
df = pd.DataFrame(df, columns = ['grp', 'day', 'score'])
# Processing
groups = df.groupby(['grp'])
for _,grp in groups:
days_row_index = grp['day'].factorize()[0]
i = min(days_row_index)
while i < max(days_row_index) - 2:
idx = (days_row_index == i) | (days_row_index == i + 1)
# Create list of targets for every subgroup
print([grp['score'].values[days_row_index == i + 3][-1]]*sum(idx))
i += 1