Python 如何对dataframe.rolling.sum()生成的每个索引中的每个值表求和

Python 如何对dataframe.rolling.sum()生成的每个索引中的每个值表求和,python,pandas,dataframe,Python,Pandas,Dataframe,我使用大型数据表,其中我试图关联所有列我使用: df = df.rolling(5).corr(pairwise = True) 477 s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099 s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589 s3 0.

我使用大型数据表,其中我试图关联所有列

我使用:

df = df.rolling(5).corr(pairwise = True)
477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)
这将生成如下数据:

df = df.rolling(5).corr(pairwise = True)
477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)
对于数据集中包含的每一行。477在本例中为行号或索引,s1-s5为列标题

目标是找出传感器之间的高度相关性。我希望通过(a)使用上面的代码使用5行滚动窗口计算相关性,以及(b)对于生成的每一行,即对于500行excel工作表,I=0到I=500,求和表dataframe.rolling(5).corr()为I的每个值生成,即每单位时间生成一个值,如底部包含的图中所示。我是stackoverflow的新手,如果我能提供更多信息,请告诉我。

df = df.rolling(5).corr(pairwise = True)
477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)
示例代码+数据:

df = df.rolling(5).corr(pairwise = True)
477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)
MATLAB代码,实现了我想要的:

% move through the data and get a correlation for 5 data points

for i=1:ns-4 C(:,:,i)=corrcoef(X(i:i+4,:));

    cact(i)=sum(C(:,:,i),'all')-nv; % subtracting nv removes the diagaonals that are = 1 and dont change

end
对于原始数据,下面是我试图用Python生成的图形,其中x轴是时间:

在两个方向上对整个表格求和,并减去1的对角线,即传感器与其自身相关

使用您的
dfn
第四行是

>>> dfn.loc[4]   
          col1      col2      col3
col1  1.000000 -0.146977 -0.227059
col2 -0.146977  1.000000  0.435216
col3 -0.227059  0.435216  1.000000
可以对基础数据使用Numpy的ndarray.sum()对整个表求和

>>> dfn.loc[4].to_numpy().sum()
3.1223603416753103
然后假设相关表是正方形的,您只需要减去列/传感器的数量。如果还没有变量,可以使用基础numpy数组的形状

>>> v = dfn.loc[4].to_numpy()
>>> v.shape
(3, 3)
>>> v.sum() - v.shape[0]
0.12236034167531029
>>>
在不使用numpy数组的情况下,可以在相减之前对相关表求和两次

>>> four = dfn.loc[4] 
>>> four.sum().sum()
3.1223603416753103
>>> four.sum().sum() - four.shape[0]
0.12236034167531029

获取整个滚动和相关性的numpy数组,并对其进行整形,以获得每个原始行的单独相关性

n_sensors = 3
v = dfn.to_numpy()  # v.shape = (30,3)
new_dims = df.shape[0], n_sensors, n_sensors
v = v.reshape(new_dims) # shape = (10,3,3)
print(v[4])

 [[ 1.         -0.14697697 -0.22705934]
 [-0.14697697  1.          0.43521648]
 [-0.22705934  0.43521648  1.        ]]
对最后两个维度求和并减去传感器的数量

result = v.sum((1,2)) - n_sensors
print(result)

[nan, nan, nan, nan, 0.12236034, 0.25316027, -2.40763192, -1.9370202, -2.28023618, -2.57886457]
也许有一种方法可以在大熊猫身上做到这一点,但我必须对此进行研究才能找到答案。也许有人会用一个全熊猫的解决方案来回答


滚动平均相关数据帧具有多索引

>>> dfn.index
MultiIndex([(0, 'col1'),
            (0, 'col2'),
            (0, 'col3'),
            (1, 'col1'),
            (1, 'col2'),
            (1, 'col3'),
            (2, 'col1'),
            (2, 'col2'),
            (2, 'col3'),
            ...
通过快速查看,并在0级网站stackoverflow.com上使用熊猫多索引求和进行搜索,我找到了-按0级分组求和,然后沿列再次求和

>>> four_five = dfn.loc[[4,5]]
>>> four_five
            col1      col2      col3
4 col1  1.000000 -0.146977 -0.227059
  col2 -0.146977  1.000000  0.435216
  col3 -0.227059  0.435216  1.000000
5 col1  1.000000  0.191238 -0.644203
  col2  0.191238  1.000000  0.579545
  col3 -0.644203  0.579545  1.000000
>>> four_five.groupby(level=0).sum()
       col1      col2      col3
4  0.625964  1.288240  1.208157
5  0.547035  1.770783  0.935343
>>> four_five.groupby(level=0).sum().sum(1)
4    3.12236
5    3.25316
dtype: float64
>>>
然后是完整的数据帧

>>> dfn.groupby(level=0).sum().sum(1) - n_sensors
0   -3.000000
1   -3.000000
2   -3.000000
3   -3.000000
4    0.122360
5    0.253160
6   -2.407632
7   -1.937020
8   -2.280236
9   -2.578865
dtype: float64 
从搜索中阅读更多的答案(我应该仔细看一下文档)


欢迎来到堆栈溢出!请将您的数据的一小部分作为可复制的代码,用于测试,以及所提供数据的预期输出。更多信息,请参见和。关于这方面的几个问题:(1)在这种情况下,大型数据表对您意味着什么?(2) 以这种方式应用
.rolling()
时,理论上,一次只能对5行应用相关函数计算。这是有意的吗?(如果希望按列执行此操作,则需要参数
axis=1
)(3)数据帧是否嵌套?(4) 既然
pandas.DataFrame.corr()
应该已经完成了整个数据帧的成对操作,那么您想用这个输出实现什么呢?
对每个整数索引的相关表求和
-您能解释一下吗?您想对每个单独的索引477执行该操作吗?您可能希望包含原始Dataframe的最小示例和操作的预期结果。大型Dataframe-->excel表格,包含55列,约500行。我想知道如何将我正在处理的数据片段发布到我的问题中。我将编辑我的问题以反映你们的评论,谢谢!。您不需要所有55列或500行。这足以说明你想做什么。它甚至可以是带有随机值的假数据。