Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/301.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从数据帧创建滑动窗口组_Python_Pandas - Fatal编程技术网

Python 从数据帧创建滑动窗口组

Python 从数据帧创建滑动窗口组,python,pandas,Python,Pandas,我正在尝试对数据进行预处理,以解决ML回归问题。 使用以下(简化)数据框: grp day score 0 A 1 2 1 A 1 4 2 A 2 6 3 A 2 8 4 A 3 10 5 A 3 12 6 A 4 14 7 A 4 16 8 A 5 18 9 A 5 20 10 B 1

我正在尝试对数据进行预处理,以解决ML回归问题。
使用以下(简化)数据框:

   grp day  score
0    A   1      2
1    A   1      4
2    A   2      6
3    A   2      8
4    A   3     10
5    A   3     12
6    A   4     14
7    A   4     16
8    A   5     18
9    A   5     20
10   B   1      2
11   B   2      4
12   B   3      8
13   B   4     16
14   B   5     32
我正试图根据day列创建一个“滑动窗口”序列列表,因此如果我有X天,则前两天的目标分数为Y

在下面的示例中,我在每组中有5天,每2天我都会看到未来2天的目标,当我到达数据帧末尾时停止:

   grp day  score
0    A   1      2
1    A   1      4
2    A   2      6
3    A   2      8
4    A   3     10
5    A   3     12
6    A   4     14
7    A   4     16
8    A   5     18
9    A   5     20
10   B   1      2
11   B   2      4
12   B   3      8
13   B   4     16
14   B   5     32

例如,这里是A组的前两组:

   grp day  score   target
0    A   1      2    16
1    A   1      4    16
2    A   2      6    16
3    A   2      8    16 <- last score value of day 4 (group A)

   grp day  score   target
0    A   2      6    20
1    A   2      8    20
2    A   3      10   20
3    A   3      12   20 <- last score value of day 5 (group A)
但是我有点迷路了。。。任何帮助都将不胜感激 更新:

我已经写了以下笨拙的代码,让我去。。。我怎样才能改进它

import pandas as pd
df = pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B'],
                   'day':['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
                   'score':[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
                   })

print(df.head(15))

df2 = pd.DataFrame({'grp':[],
                    'day':[],
                    'score':[]})

groups = df.groupby(['grp'])
GROUP_SIZE = 2
LOOK_AHEAD = 2
sequences = []

for _,grp in groups:
  days_row_index = grp['day'].factorize()[0]
  days_group = grp.groupby(days_row_index)
  for _,day in days_group:
    day_index = int(day['day'].values[0])
    if day_index + LOOK_AHEAD < len(days_group):
      target = days_group.get_group(day_index + LOOK_AHEAD)['score'].values[-1]
      print(day_index,day_index + LOOK_AHEAD,day['score'].values[-1],"----------->",target)
      day['target'] = target
      df2 = pd.concat([df2,day])
      for i in range(0, GROUP_SIZE-1):
        if day_index + i >= len(days_group):
          break
        next_day = days_group.get_group(day_index + i)
        next_day['target'] = target
        df2 = pd.concat([df2,next_day])
      sequences.append(df2.copy())
      df2 = df2.iloc[0:0]
sequences
将熊猫作为pd导入
df=pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B',],
“day”:['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
“分数”:[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
})
打印(测向头(15))
df2=pd.DataFrame({'grp':[],
“日”:[],
“分数”:[]})
groups=df.groupby(['grp'])
组大小=2
向前看=2
序列=[]
对于组中的grp:
天\行\索引=grp['day'].factorize()[0]
天数组=grp.groupby(天数行索引)
对于u,天中的天u组:
day_index=int(day['day'].值[0])
如果日指数+前瞻”,目标)
日期['target']=目标
df2=pd.concat([df2,天])
对于范围内的i(0,组大小为1):
如果天索引+i>=len(天组):
打破
下一天=天组。获取天组(天索引+i)
下一天['target']=目标
df2=pd.concat([df2,下一天])
sequences.append(df2.copy())
df2=df2.iloc[0:0]
序列

在您提出的解决方案的基础上,我编写了这段代码,我确信它可以被优化,所以任何人都可以改进它。让我知道这是否是您想要的(我冒昧地创建了另一个“混合”组“C”,以测试更通用的方法)

将熊猫作为pd导入
#创建测试数据帧
df=[
[A',1,2],
[A',1,4],
[A',2,6],
[A',2,8],
[A',3,10],
[A',3,12],
[A',4,14],
[A',4,16],
[A',5,18],
[A',5,20],
[B',1,2],
[B',2,4],
[B',3,8],
[B',4,16],
[B',5,32],
[C',1,2],
[C',1,4],
[C',2,8],
[C',3,16],
[C',3,20],
[C',4,24],
[C',5,28]
]
df=pd.DataFrame(df,列=['grp','day','score'])
#加工
groups=df.groupby(['grp'])
对于组中的grp:
天\行\索引=grp['day'].factorize()[0]
i=最小值(天数、行数、索引)
而i
那是一个打字错误,谢谢,很抱歉删除了评论,我意识到这是一个明显的打字错误,所以它实际上不需要评论。无论如何,我还有一个问题,B组第2天和第4分的目标应该是32,对吗?没关系,我现在明白为什么是16而不是32了。
import pandas as pd
df = pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B'],
                   'day':['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
                   'score':[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
                   })

print(df.head(15))

df2 = pd.DataFrame({'grp':[],
                    'day':[],
                    'score':[]})

groups = df.groupby(['grp'])
GROUP_SIZE = 2
LOOK_AHEAD = 2
sequences = []

for _,grp in groups:
  days_row_index = grp['day'].factorize()[0]
  days_group = grp.groupby(days_row_index)
  for _,day in days_group:
    day_index = int(day['day'].values[0])
    if day_index + LOOK_AHEAD < len(days_group):
      target = days_group.get_group(day_index + LOOK_AHEAD)['score'].values[-1]
      print(day_index,day_index + LOOK_AHEAD,day['score'].values[-1],"----------->",target)
      day['target'] = target
      df2 = pd.concat([df2,day])
      for i in range(0, GROUP_SIZE-1):
        if day_index + i >= len(days_group):
          break
        next_day = days_group.get_group(day_index + i)
        next_day['target'] = target
        df2 = pd.concat([df2,next_day])
      sequences.append(df2.copy())
      df2 = df2.iloc[0:0]
sequences
import pandas as pd

# Create test dataframe
df = [
     ['A', 1, 2],
     ['A', 1, 4],
     ['A', 2, 6],
     ['A', 2, 8],
     ['A', 3, 10],
     ['A', 3, 12],
     ['A', 4, 14],
     ['A', 4, 16],
     ['A', 5, 18],
     ['A', 5, 20],
     ['B', 1, 2],
     ['B', 2, 4],
     ['B', 3, 8],
     ['B', 4, 16],
     ['B', 5, 32],
     ['C', 1, 2],
     ['C', 1, 4],
     ['C', 2, 8],
     ['C', 3, 16],
     ['C', 3, 20],
     ['C', 4, 24],
     ['C', 5, 28]
     ]
df = pd.DataFrame(df, columns = ['grp', 'day', 'score'])

# Processing
groups = df.groupby(['grp'])
for _,grp in groups:
  days_row_index = grp['day'].factorize()[0]
  i = min(days_row_index)
  while i < max(days_row_index) - 2:
      idx = (days_row_index == i) | (days_row_index == i + 1)
      # Create list of targets for every subgroup
      print([grp['score'].values[days_row_index == i + 3][-1]]*sum(idx))
      i += 1