Python 通过最后N个值的平均值过滤数据帧_Python_Pandas_Filtering_Analysis

Python 通过最后N个值的平均值过滤数据帧

python pandas

Python 通过最后N个值的平均值过滤数据帧,python,pandas,filtering,analysis,Python,Pandas,Filtering,Analysis,我试图获取最后3行的平均值大于筛选集中所有行的总体平均值的所有记录 _filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05] _last_n_records = _filtered_d.tail(3) 像这样的 _filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.

我试图获取最后3行的平均值大于筛选集中所有行的总体平均值的所有记录

_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)

像这样的

_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]

但是，这里的问题是值长度不正确。有什么建议吗

ValueError: Series lengths must match to compare

样本数据

这有一个关于年和月的索引，以及两列

            Col1    Col2
year    month       
2005    12  0.533835    0.170679
        12  0.494733    0.198347
2006    3   0.440098    0.202240
        6   0.410285    0.188421
        9   0.502420    0.200188
        12  0.522253    0.118680
2007    3   0.378120    0.171192
        6   0.431989    0.145158
        9   0.612036    0.178097
        12  0.519766    0.252196
2008    3   0.547705    0.202163
        6   0.560985    0.238591
        9   0.617320    0.199537
        12  0.343939    0.253855

为什么不直接在过滤后的数据帧上使用布尔索引

df[df.tail(3).mean() > df.mean()]

演示

>>> df
   0  1  2  3  4
0  4  8  2  4  6
1  0  0  0  2  8
2  5  3  0  9  3
3  7  5  5  1  2
4  9  7  8  9  4

>>> df[df.tail(3).mean() > df.mean()]
   0  1  2  3  4
0  4  8  2  4  6
1  0  0  0  2  8
2  5  3  0  9  3
3  7  5  5  1  2

多索引编辑的更新示例

同样的方法也适用于你的多索引样本，当然，我们只需要稍微改变一下掩码

>>> df 
             col1      col2
2005 12 -0.340088 -0.574140
     12 -0.814014  0.430580
2006 3   0.464008  0.438494
     6   0.019508 -0.635128
     9   0.622645 -0.824526
     12 -1.674920 -1.027275
2007 3   0.397133  0.659467
     6   0.026170 -0.052063
     9   0.835561  0.608067
     12  0.736873 -0.613877
2008 3   0.344781 -0.566392
     6  -0.653290 -0.264992
     9   0.080592 -0.548189
     12  0.585642  1.149779

>>> df.loc[:,df.tail(3).mean() > df.mean()] 
             col2
2005 12 -0.574140
     12  0.430580
2006 3   0.438494
     6  -0.635128
     9  -0.824526
     12 -1.027275
2007 3   0.659467
     6  -0.052063
     9   0.608067
     12 -0.613877
2008 3  -0.566392
     6  -0.264992
     9  -0.548189
     12  1.149779

是指数据帧中的最后3行还是前3行（即，如果我在第5行，那么它应该是3、4和5的平均值）？是的，所以你会有df=[1、2、3、4、5、6、7]，你想看看最后3个值的平均值是否大于数组中所有值的平均值（在时间序列中有意义：））类似，但是现在给出了不同的错误：-->95\u filtered\u growing=\u filtered\u d\u all[\u last\u n\u records>\u filtered\u d\u all.mean（）]值错误：序列长度必须与compare@Eamonn你需要比较这两种方法，就像我在我的例子中所做的那样。是的，我先这么做了，但同样的错误也存在。-->96_filtered_growing=_filtered_d_all[_filtered_d.tail（3.mean（）>_filtered_d_all.mean（）]值错误：序列长度必须与compare@Eamonn同样的概念适用于多索引，请参阅我的编辑。谢谢@mitch！我通过这个调用实现了它：_filtered_d_all=_filtered_d.iloc[：，0:50].loc[：，_filtered_d.tail（3.mean（）>_filtered_d.mean（）].loc[：，_filtered_d.mean（）>0.05]，然后结果如预期的那样