Python 如何对dataframe.rolling.sum（）生成的每个索引中的每个值表求和_Python_Pandas_Dataframe

Python 如何对dataframe.rolling.sum（）生成的每个索引中的每个值表求和

python pandas dataframe

Python 如何对dataframe.rolling.sum（）生成的每个索引中的每个值表求和,python,pandas,dataframe,Python,Pandas,Dataframe,我使用大型数据表，其中我试图关联所有列我使用： df = df.rolling(5).corr(pairwise = True) 477 s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099 s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589 s3 0.

我使用大型数据表，其中我试图关联所有列

我使用：

df = df.rolling(5).corr(pairwise = True)

477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)

这将生成如下数据：

df = df.rolling(5).corr(pairwise = True)

477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)

对于数据集中包含的每一行。477在本例中为行号或索引，s1-s5为列标题

目标是找出传感器之间的高度相关性。我希望通过（a）使用上面的代码使用5行滚动窗口计算相关性，以及（b）对于生成的每一行，即对于500行excel工作表，I=0到I=500，求和表dataframe.rolling（5）.corr（）为I的每个值生成，即每单位时间生成一个值，如底部包含的图中所示。我是stackoverflow的新手，如果我能提供更多信息，请告诉我。

df = df.rolling(5).corr(pairwise = True)

477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)

示例代码+数据：

df = df.rolling(5).corr(pairwise = True)

477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)

MATLAB代码，实现了我想要的：

% move through the data and get a correlation for 5 data points

for i=1:ns-4 C(:,:,i)=corrcoef(X(i:i+4,:));

    cact(i)=sum(C(:,:,i),'all')-nv; % subtracting nv removes the diagaonals that are = 1 and dont change

end

对于原始数据，下面是我试图用Python生成的图形，其中x轴是时间：

在两个方向上对整个表格求和，并减去1的对角线，即传感器与其自身相关

使用您的

dfn

第四行是

>>> dfn.loc[4]   
          col1      col2      col3
col1  1.000000 -0.146977 -0.227059
col2 -0.146977  1.000000  0.435216
col3 -0.227059  0.435216  1.000000

可以对基础数据使用Numpy的ndarray.sum（）对整个表求和

>>> dfn.loc[4].to_numpy().sum()
3.1223603416753103

然后假设相关表是正方形的，您只需要减去列/传感器的数量。如果还没有变量，可以使用基础numpy数组的形状

>>> v = dfn.loc[4].to_numpy()
>>> v.shape
(3, 3)
>>> v.sum() - v.shape[0]
0.12236034167531029
>>>

在不使用numpy数组的情况下，可以在相减之前对相关表求和两次

>>> four = dfn.loc[4] 
>>> four.sum().sum()
3.1223603416753103
>>> four.sum().sum() - four.shape[0]
0.12236034167531029

获取整个滚动和相关性的numpy数组，并对其进行整形，以获得每个原始行的单独相关性

n_sensors = 3
v = dfn.to_numpy()  # v.shape = (30,3)
new_dims = df.shape[0], n_sensors, n_sensors
v = v.reshape(new_dims) # shape = (10,3,3)
print(v[4])

 [[ 1.         -0.14697697 -0.22705934]
 [-0.14697697  1.          0.43521648]
 [-0.22705934  0.43521648  1.        ]]

对最后两个维度求和并减去传感器的数量

result = v.sum((1,2)) - n_sensors
print(result)

[nan, nan, nan, nan, 0.12236034, 0.25316027, -2.40763192, -1.9370202, -2.28023618, -2.57886457]

也许有一种方法可以在大熊猫身上做到这一点，但我必须对此进行研究才能找到答案。也许有人会用一个全熊猫的解决方案来回答

滚动平均相关数据帧具有多索引

>>> dfn.index
MultiIndex([(0, 'col1'),
            (0, 'col2'),
            (0, 'col3'),
            (1, 'col1'),
            (1, 'col2'),
            (1, 'col3'),
            (2, 'col1'),
            (2, 'col2'),
            (2, 'col3'),
            ...

通过快速查看，并在0级网站stackoverflow.com上使用熊猫多索引求和进行搜索，我找到了-按0级分组求和，然后沿列再次求和

>>> four_five = dfn.loc[[4,5]]
>>> four_five
            col1      col2      col3
4 col1  1.000000 -0.146977 -0.227059
  col2 -0.146977  1.000000  0.435216
  col3 -0.227059  0.435216  1.000000
5 col1  1.000000  0.191238 -0.644203
  col2  0.191238  1.000000  0.579545
  col3 -0.644203  0.579545  1.000000
>>> four_five.groupby(level=0).sum()
       col1      col2      col3
4  0.625964  1.288240  1.208157
5  0.547035  1.770783  0.935343
>>> four_five.groupby(level=0).sum().sum(1)
4    3.12236
5    3.25316
dtype: float64
>>>

然后是完整的数据帧

>>> dfn.groupby(level=0).sum().sum(1) - n_sensors
0   -3.000000
1   -3.000000
2   -3.000000
3   -3.000000
4    0.122360
5    0.253160
6   -2.407632
7   -1.937020
8   -2.280236
9   -2.578865
dtype: float64

从搜索中阅读更多的答案（我应该仔细看一下文档）

欢迎来到堆栈溢出！请将您的数据的一小部分作为可复制的代码，用于测试，以及所提供数据的预期输出。更多信息，请参见和。关于这方面的几个问题：（1）在这种情况下，大型数据表对您意味着什么？（2）以这种方式应用

.rolling（）

时，理论上，一次只能对5行应用相关函数计算。这是有意的吗？（如果希望按列执行此操作，则需要参数

axis=1

）（3）数据帧是否嵌套？（4）既然

pandas.DataFrame.corr（）

应该已经完成了整个数据帧的成对操作，那么您想用这个输出实现什么呢？

对每个整数索引的相关表求和

-您能解释一下吗？您想对每个单独的索引477执行该操作吗？您可能希望包含原始Dataframe的最小示例和操作的预期结果。大型Dataframe-->excel表格，包含55列，约500行。我想知道如何将我正在处理的数据片段发布到我的问题中。我将编辑我的问题以反映你们的评论，谢谢！。您不需要所有55列或500行。这足以说明你想做什么。它甚至可以是带有随机值的假数据。