Python 使用groupby删除数据帧中的异常值_Python_Pandas

Python 使用groupby删除数据帧中的异常值

python pandas

Python 使用groupby删除数据帧中的异常值,python,pandas,Python,Pandas,我有一个报告日期、时间间隔和全年总量的数据框架。我希望能够在每个时间间隔内删除异常值这是我所能做到的 dft.head() Report Date Time Interval Total Volume 5784 2016-03-01 24 467.0 5785 2016-03-01 25 580.0 5786 2016-03-01 26 716.0 5787 2016-03-01 27 803.0 5788 2016-03-01 2

我有一个报告日期、时间间隔和全年总量的数据框架。我希望能够在每个时间间隔内删除异常值

这是我所能做到的

dft.head()

    Report Date Time Interval   Total Volume
5784    2016-03-01  24  467.0
5785    2016-03-01  25  580.0
5786    2016-03-01  26  716.0
5787    2016-03-01  27  803.0
5788    2016-03-01  28  941.0

所以我计算分位数

low = .05
high = .95
dfq = dft.groupby(['Time Interval']).quantile([low, high])
print(dfq).head()

                    Total Volume
Time Interval                   
24            0.05        420.15
              0.95        517.00
25            0.05        521.90
              0.95        653.55
26            0.05        662.75

然后我希望能够使用它们来删除每个时间间隔内的异常值，就像这样

dft = dft.apply(lambda x: x[(x>dfq.loc[low,x.name]) & (x < dfq.loc[high,x.name])], axis=0)

dft=dft.apply（λx:x[（x>dfq.loc[low，x.name]）&（x


非常感谢任何指点/建议。
一种方法是过滤掉以下内容：
In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)

In [12]: res
Out[12]:
             0.05   0.95
Date
2016-03-01  489.6  913.4

现在，我们可以使用loc
和过滤器查找每行的这些值：
In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
Out[13]:
Date
2016-03-01    False
2016-03-01     True
2016-03-01     True
2016-03-01     True
2016-03-01    False
dtype: bool

In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
Out[14]:
   Report        Date  Time  Interval  Total Volume
1    5785  2016-03-01    25     580.0           NaN
2    5786  2016-03-01    26     716.0           NaN
3    5787  2016-03-01    27     803.0           NaN

[13]中的：（res.loc[df.Date，0.05]

注意：按“时间间隔”分组的效果相同，但在您的示例中不会过滤任何行
 df[df.groupby（“ReportDate”）.TotalVolume\
变换（λx:（x（x.分位数（0.05）））.eq（1）]
Out[1033]：
ReportDate时间间隔TotalVolume
5785  2016-03-01            25        580.0
5786  2016-03-01            26        716.0
5787  2016-03-01            27        803.0

df[df.groupby("ReportDate").TotalVolume.\
      transform(lambda x : (x<x.quantile(0.95))&(x>(x.quantile(0.05)))).eq(1)]
Out[1033]: 
      ReportDate  TimeInterval  TotalVolume
5785  2016-03-01            25        580.0
5786  2016-03-01            26        716.0
5787  2016-03-01            27        803.0