Python 如何对dataframe.rolling.sum()生成的每个索引中的每个值表求和
我使用大型数据表,其中我试图关联所有列Python 如何对dataframe.rolling.sum()生成的每个索引中的每个值表求和,python,pandas,dataframe,Python,Pandas,Dataframe,我使用大型数据表,其中我试图关联所有列我使用: df = df.rolling(5).corr(pairwise = True) 477 s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099 s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589 s3 0.
我使用:
df = df.rolling(5).corr(pairwise = True)
477
s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099
s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589
s3 0.384720 0.907782 0.718307 0.645064 1.000000 -0.831378 0.406054
s4 -0.347547 -0.651557 -0.518748 -0.455503 -0.831378 1.000000 -0.569301
s5 -0.315022 0.576705 0.772099 0.447589 0.406054 -0.569301 1.000000
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
dfn = df.rolling(5).corr(pairwise = True)
这将生成如下数据:df = df.rolling(5).corr(pairwise = True)
477
s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099
s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589
s3 0.384720 0.907782 0.718307 0.645064 1.000000 -0.831378 0.406054
s4 -0.347547 -0.651557 -0.518748 -0.455503 -0.831378 1.000000 -0.569301
s5 -0.315022 0.576705 0.772099 0.447589 0.406054 -0.569301 1.000000
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
dfn = df.rolling(5).corr(pairwise = True)
对于数据集中包含的每一行。477在本例中为行号或索引,s1-s5为列标题
目标是找出传感器之间的高度相关性。我希望通过(a)使用上面的代码使用5行滚动窗口计算相关性,以及(b)对于生成的每一行,即对于500行excel工作表,I=0到I=500,求和表dataframe.rolling(5).corr()为I的每个值生成,即每单位时间生成一个值,如底部包含的图中所示。我是stackoverflow的新手,如果我能提供更多信息,请告诉我。df = df.rolling(5).corr(pairwise = True)
477
s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099
s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589
s3 0.384720 0.907782 0.718307 0.645064 1.000000 -0.831378 0.406054
s4 -0.347547 -0.651557 -0.518748 -0.455503 -0.831378 1.000000 -0.569301
s5 -0.315022 0.576705 0.772099 0.447589 0.406054 -0.569301 1.000000
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
dfn = df.rolling(5).corr(pairwise = True)
示例代码+数据:df = df.rolling(5).corr(pairwise = True)
477
s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099
s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589
s3 0.384720 0.907782 0.718307 0.645064 1.000000 -0.831378 0.406054
s4 -0.347547 -0.651557 -0.518748 -0.455503 -0.831378 1.000000 -0.569301
s5 -0.315022 0.576705 0.772099 0.447589 0.406054 -0.569301 1.000000
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
dfn = df.rolling(5).corr(pairwise = True)
MATLAB代码,实现了我想要的:
% move through the data and get a correlation for 5 data points
for i=1:ns-4 C(:,:,i)=corrcoef(X(i:i+4,:));
cact(i)=sum(C(:,:,i),'all')-nv; % subtracting nv removes the diagaonals that are = 1 and dont change
end
对于原始数据,下面是我试图用Python生成的图形,其中x轴是时间:
在两个方向上对整个表格求和,并减去1的对角线,即传感器与其自身相关
使用您的dfn
第四行是
>>> dfn.loc[4]
col1 col2 col3
col1 1.000000 -0.146977 -0.227059
col2 -0.146977 1.000000 0.435216
col3 -0.227059 0.435216 1.000000
可以对基础数据使用Numpy的ndarray.sum()对整个表求和
>>> dfn.loc[4].to_numpy().sum()
3.1223603416753103
然后假设相关表是正方形的,您只需要减去列/传感器的数量。如果还没有变量,可以使用基础numpy数组的形状
>>> v = dfn.loc[4].to_numpy()
>>> v.shape
(3, 3)
>>> v.sum() - v.shape[0]
0.12236034167531029
>>>
在不使用numpy数组的情况下,可以在相减之前对相关表求和两次
>>> four = dfn.loc[4]
>>> four.sum().sum()
3.1223603416753103
>>> four.sum().sum() - four.shape[0]
0.12236034167531029
获取整个滚动和相关性的numpy数组,并对其进行整形,以获得每个原始行的单独相关性
n_sensors = 3
v = dfn.to_numpy() # v.shape = (30,3)
new_dims = df.shape[0], n_sensors, n_sensors
v = v.reshape(new_dims) # shape = (10,3,3)
print(v[4])
[[ 1. -0.14697697 -0.22705934]
[-0.14697697 1. 0.43521648]
[-0.22705934 0.43521648 1. ]]
对最后两个维度求和并减去传感器的数量
result = v.sum((1,2)) - n_sensors
print(result)
[nan, nan, nan, nan, 0.12236034, 0.25316027, -2.40763192, -1.9370202, -2.28023618, -2.57886457]
也许有一种方法可以在大熊猫身上做到这一点,但我必须对此进行研究才能找到答案。也许有人会用一个全熊猫的解决方案来回答
滚动平均相关数据帧具有多索引
>>> dfn.index
MultiIndex([(0, 'col1'),
(0, 'col2'),
(0, 'col3'),
(1, 'col1'),
(1, 'col2'),
(1, 'col3'),
(2, 'col1'),
(2, 'col2'),
(2, 'col3'),
...
通过快速查看,并在0级网站stackoverflow.com上使用熊猫多索引求和进行搜索,我找到了-按0级分组求和,然后沿列再次求和
>>> four_five = dfn.loc[[4,5]]
>>> four_five
col1 col2 col3
4 col1 1.000000 -0.146977 -0.227059
col2 -0.146977 1.000000 0.435216
col3 -0.227059 0.435216 1.000000
5 col1 1.000000 0.191238 -0.644203
col2 0.191238 1.000000 0.579545
col3 -0.644203 0.579545 1.000000
>>> four_five.groupby(level=0).sum()
col1 col2 col3
4 0.625964 1.288240 1.208157
5 0.547035 1.770783 0.935343
>>> four_five.groupby(level=0).sum().sum(1)
4 3.12236
5 3.25316
dtype: float64
>>>
然后是完整的数据帧
>>> dfn.groupby(level=0).sum().sum(1) - n_sensors
0 -3.000000
1 -3.000000
2 -3.000000
3 -3.000000
4 0.122360
5 0.253160
6 -2.407632
7 -1.937020
8 -2.280236
9 -2.578865
dtype: float64
从搜索中阅读更多的答案(我应该仔细看一下文档)
欢迎来到堆栈溢出!请将您的数据的一小部分作为可复制的代码,用于测试,以及所提供数据的预期输出。更多信息,请参见和。关于这方面的几个问题:(1)在这种情况下,大型数据表对您意味着什么?(2) 以这种方式应用
.rolling()
时,理论上,一次只能对5行应用相关函数计算。这是有意的吗?(如果希望按列执行此操作,则需要参数axis=1
)(3)数据帧是否嵌套?(4) 既然pandas.DataFrame.corr()
应该已经完成了整个数据帧的成对操作,那么您想用这个输出实现什么呢?对每个整数索引的相关表求和
-您能解释一下吗?您想对每个单独的索引477执行该操作吗?您可能希望包含原始Dataframe的最小示例和操作的预期结果。大型Dataframe-->excel表格,包含55列,约500行。我想知道如何将我正在处理的数据片段发布到我的问题中。我将编辑我的问题以反映你们的评论,谢谢!。您不需要所有55列或500行。这足以说明你想做什么。它甚至可以是带有随机值的假数据。