Arrays 基于前面的n行在groupby()中创建新列的较短方法
我有以下代码,对于已排序的Pandas数据帧,按一列分组,并创建两个新列:一个根据组中的前4行和当前行,另一个基于组中的未来行Arrays 基于前面的n行在groupby()中创建新列的较短方法,arrays,pandas,dataframe,group-by,shift,Arrays,Pandas,Dataframe,Group By,Shift,我有以下代码,对于已排序的Pandas数据帧,按一列分组,并创建两个新列:一个根据组中的前4行和当前行,另一个基于组中的未来行 data_test = {'nr':[1,1,1,1,1,6,6,6,6,6,6,6], 'val':[11,12,13,14,15,61,62,63,64,65,66,67]} df_test = pd.DataFrame (data_test, columns = ['nr','val']) print (df_test) df_tes
data_test = {'nr':[1,1,1,1,1,6,6,6,6,6,6,6],
'val':[11,12,13,14,15,61,62,63,64,65,66,67]}
df_test = pd.DataFrame (data_test, columns = ['nr','val'])
print (df_test)
df_test['past4'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(4).fillna(0))
df_test['past3'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(3).fillna(0))
df_test['past2'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(2).fillna(0))
df_test['past1'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(1).fillna(0))
df_test['future'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(-1).fillna(0))
df_test['amounts'] = df_test[['past4', 'past3','past2','past1','val']].values.tolist()
df_test.drop(columns = ['past4', 'past3', 'past2', 'past1'], inplace = True)
df_test
nr val future amounts
0 1 11 12 [0, 0, 0, 0, 11]
1 1 12 13 [0, 0, 0, 11, 12]
2 1 13 14 [0, 0, 11, 12, 13]
3 1 14 15 [0, 11, 12, 13, 14]
4 1 15 0 [11, 12, 13, 14, 15]
5 6 61 62 [0, 0, 0, 0, 61]
6 6 62 63 [0, 0, 0, 61, 62]
7 6 63 64 [0, 0, 61, 62, 63]
8 6 64 65 [0, 61, 62, 63, 64]
9 6 65 66 [61, 62, 63, 64, 65]
10 6 66 67 [62, 63, 64, 65, 66]
11 6 67 0 [63, 64, 65, 66, 67]
因此,以下框架:
nr val
0 1 11
1 1 12
2 1 13
3 1 14
4 1 15
5 6 61
6 6 62
7 6 63
8 6 64
9 6 65
10 6 66
11 6 67
现在,我必须按照下面的代码按“nr”分组,并为每行构建一列,其中包含组中“val”的前4个值和当前值。类似地,构建一个额外的列,每行包含组中“val”的未来值
data_test = {'nr':[1,1,1,1,1,6,6,6,6,6,6,6],
'val':[11,12,13,14,15,61,62,63,64,65,66,67]}
df_test = pd.DataFrame (data_test, columns = ['nr','val'])
print (df_test)
df_test['past4'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(4).fillna(0))
df_test['past3'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(3).fillna(0))
df_test['past2'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(2).fillna(0))
df_test['past1'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(1).fillna(0))
df_test['future'] = df_test.groupby(['nr'])['val'].transform(lambda x: x.shift(-1).fillna(0))
df_test['amounts'] = df_test[['past4', 'past3','past2','past1','val']].values.tolist()
df_test.drop(columns = ['past4', 'past3', 'past2', 'past1'], inplace = True)
df_test
nr val future amounts
0 1 11 12 [0, 0, 0, 0, 11]
1 1 12 13 [0, 0, 0, 11, 12]
2 1 13 14 [0, 0, 11, 12, 13]
3 1 14 15 [0, 11, 12, 13, 14]
4 1 15 0 [11, 12, 13, 14, 15]
5 6 61 62 [0, 0, 0, 0, 61]
6 6 62 63 [0, 0, 0, 61, 62]
7 6 63 64 [0, 0, 61, 62, 63]
8 6 64 65 [0, 61, 62, 63, 64]
9 6 65 66 [61, 62, 63, 64, 65]
10 6 66 67 [62, 63, 64, 65, 66]
11 6 67 0 [63, 64, 65, 66, 67]
我相信我应该能够更容易地构建一个名为“金额”的列表列,可能是一行。如何做到这一点?将bloc迁移到函数中会使代码更加模块化和轻巧 在此特定示例中,我们将
反向(范围(5))
作为shift\u值发送,这表示列表[4,3,2,1,0]
将熊猫作为pd导入
数据检验={'nr':[1,1,1,1,6,6,6,6,6],
“val”:[11,12,13,14,15,61,62,63,64,65,66,67]}
df_test=pd.DataFrame(数据_test,列=['nr','val'])
def生成_过去(df、shift_值):
serie=pd.DataFrame([df.groupby('nr')['val'].transform(lambda x:x.shift(shift_值).shift_值中的shift_值的fillna(0)))
返回serie.T.values.tolist()
df_test['future']=df_test.groupby(['nr'])['val'].transform(lambda x:x.shift(-1).fillna(0))
df_测试['amounts']=生成_过去(df_测试,反转(范围(5)))
使用自定义函数创建嵌套列表,如:
def f(x):
#list comprehension with shift by 4,3,2,1,0
L = [x['val'].shift(i).fillna(0) for i in range(4, -1, -1)]
#shifting to another column
x['future'] = x['val'].shift(-1).fillna(0).astype(int)
#column filled by lists
x['amounts'] = pd.Series(np.array(L).astype(int).T.tolist(), index=x.index)
return (x)
df_test = df_test.groupby(['nr']).apply(f)
print (df_test)
nr val future amounts
0 1 11 12 [0, 0, 0, 0, 11]
1 1 12 13 [0, 0, 0, 11, 12]
2 1 13 14 [0, 0, 11, 12, 13]
3 1 14 15 [0, 11, 12, 13, 14]
4 1 15 0 [11, 12, 13, 14, 15]
5 6 61 62 [0, 0, 0, 0, 61]
6 6 62 63 [0, 0, 0, 61, 62]
7 6 63 64 [0, 0, 61, 62, 63]
8 6 64 65 [0, 61, 62, 63, 64]
9 6 65 66 [61, 62, 63, 64, 65]
10 6 66 67 [62, 63, 64, 65, 66]
11 6 67 0 [63, 64, 65, 66, 67]
您可以这样尝试(与jezrael相同),但不使用apply。这不是一个好方法,因为我正在制作新的数据帧
df_new = pd.DataFrame()
for i,grp in df_test.groupby('nr'):
grp = grp.reset_index(drop=True)
grp['future'] = pd.Series(grp['val'].shift(-1).fillna(0).astype(int))
grp['amount'] = pd.Series([grp['val'].shift(i).fillna(0).values[-5:] for i in range(len(grp)-1,-1,-1)])
df_new = df_new.append(grp)
df_new.reset_index(drop=True, inplace=True)
df\u新建:
nr val future amounts
0 1 11 12 [0.0, 0.0, 0.0, 0.0, 11.0]
1 1 12 13 [0.0, 0.0, 0.0, 11.0, 12.0]
2 1 13 14 [0.0, 0.0, 11.0, 12.0, 13.0]
3 1 14 15 [0.0, 11.0, 12.0, 13.0, 14.0]
4 1 15 0 [11, 12, 13, 14, 15]
5 6 61 62 [0.0, 0.0, 0.0, 0.0, 61.0]
6 6 62 63 [0.0, 0.0, 0.0, 61.0, 62.0]
7 6 63 64 [0.0, 0.0, 61.0, 62.0, 63.0]
8 6 64 65 [0.0, 61.0, 62.0, 63.0, 64.0]
9 6 65 66 [61.0, 62.0, 63.0, 64.0, 65.0]
10 6 66 67 [62.0, 63.0, 64.0, 65.0, 66.0]
11 6 67 0 [63, 64, 65, 66, 67]
回答得很好,我正在考虑使用索引。重复和重新索引(4)
来创建一个新的df,并通过每个唯一的nr
和value
生成两个数据帧,但这更简洁,可能也更节省内存。