Python 基于groupy和rolling的熊猫窗口相关_Python_Pandas_Rolling Computation

Python 基于groupy和rolling的熊猫窗口相关

python pandas

Python 基于groupy和rolling的熊猫窗口相关,python,pandas,rolling-computation,Python,Pandas,Rolling Computation,我想计算分组数据的滚动相关性。我怎样才能在熊猫身上做呢？我已经创建了虚拟数据，并使用下面的PySpark使用SQL完成了它 import pandas as pd import numpy as np from datetime import datetime, timedelta my_array = np.random.random(90).reshape(-1, 3) groups = np.array(['a', 'b', 'c']).reshape(-1,1) groups = np

我想计算分组数据的滚动相关性。我怎样才能在熊猫身上做呢？我已经创建了虚拟数据，并使用下面的PySpark使用SQL完成了它

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

my_array = np.random.random(90).reshape(-1, 3)
groups = np.array(['a', 'b', 'c']).reshape(-1,1)
groups = np.repeat(groups, 10).reshape(-1, 1)
my_array = np.append(my_array, groups, axis = 1)
df = pd.DataFrame(my_array, columns = list('abcd'))
df['date'] = pd.to_datetime([datetime.today() + timedelta(i) for i in range(30)])

spark.createDataFrame(df).createOrReplaceTempView('df_tbl')
spark.sql("""
   select *, 
     corr(a,b) over (partition by d order by date rows between 8 preceding and current row) as cor1,
     corr(a,b) over (partition by d order by date rows between 8 preceding and current row) as cor2
   from df_tbl
  """).toPandas().head(10)

使用

date

作为索引，并应用滚动分组功能计算

和

上的

corr

。稍后

reset_index

将索引编入列中，因为很难访问时间戳作为

index

。像这样

df.set_index('date', inplace=True)
result = df.groupby(['d'])[['a','b']].rolling(8).corr()
result.reset_index(inplace=True)

输出如下所示：

    d   date                    level_2 a   b
0   a   2020-03-03 21:21:29.512854  a   NaN NaN
1   a   2020-03-03 21:21:29.512854  b   NaN NaN
2   a   2020-03-04 21:21:29.512866  a   NaN NaN
3   a   2020-03-04 21:21:29.512866  b   NaN NaN
4   a   2020-03-05 21:21:29.512869  a   NaN NaN
5   a   2020-03-05 21:21:29.512869  b   NaN NaN
6   a   2020-03-06 21:21:29.512871  a   NaN NaN
7   a   2020-03-06 21:21:29.512871  b   NaN NaN
8   a   2020-03-07 21:21:29.512872  a   NaN NaN
9   a   2020-03-07 21:21:29.512872  b   NaN NaN
10  a   2020-03-08 21:21:29.512874  a   NaN NaN
11  a   2020-03-08 21:21:29.512874  b   NaN NaN
12  a   2020-03-09 21:21:29.512876  a   NaN NaN
13  a   2020-03-09 21:21:29.512876  b   NaN NaN
14  a   2020-03-10 21:21:29.512878  a   1.000000    -0.166854
15  a   2020-03-10 21:21:29.512878  b   -0.166854   1.000000
16  a   2020-03-11 21:21:29.512880  a   1.000000    -0.095549
17  a   2020-03-11 21:21:29.512880  b   -0.095549   1.000000
...
...

你的预期产量是多少？接近吗<代码>df.groupby（'d'）。应用（lambda x:x.rolling（8，最小周期=1）。corr（））。您的窗口是否等于

？