Python 使用分组前过滤异常值_Python_Pandas_Numpy

Python 使用分组前过滤异常值

python pandas numpy

Python 使用分组前过滤异常值,python,pandas,numpy,Python,Pandas,Numpy,我有一个带有价格列（p）的数据帧，我有一些不需要的值，比如（0，1.50，92.80，0.80）。在我计算产品代码的平均价格之前，我想先去掉这些异常值 Code Year Month Day Q P 0 100 2017 1 4 2.0 42.90 1 100 2017 1 9 2.0 42.90 2

我有一个带有价格列（p）的数据帧，我有一些不需要的值，比如（0，1.50，92.80，0.80）。在我计算产品代码的平均价格之前，我想先去掉这些异常值

                Code    Year    Month  Day   Q      P
0               100     2017       1    4   2.0  42.90
1               100     2017       1    9   2.0  42.90
2               100     2017       1   18   1.0  45.05
3               100     2017       1   19   2.0  45.05
4               100     2017       1   20   1.0  45.05
5               100     2017       1   24  10.0  46.40
6               100     2017       1   26   1.0  46.40
7               100     2017       1   28   2.0  92.80
8               100     2017       2    1   0.0   0.00
9               100     2017       2    7   2.0   1.50
10              100     2017       2    8   5.0   0.80
11              100     2017       2    9   1.0  45.05
12              100     2017       2   11   1.0   1.50
13              100     2017       3    8   1.0  49.90
14              100     2017       3   17   6.0  45.05
15              100     2017       3   24   1.0  45.05
16              100     2017       3   30   2.0   1.50

如何筛选每个产品的异常值（按代码分组）

我试过这个：

stds = 1.0  # Number of standard deviation that defines 'outlier'.
z = df[['Code','P']].groupby('Code').transform(
    lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
df[outliers.any(axis=1)]

然后：

print(df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean())

但是异常值过滤器无法正常工作。

IIUC您可以在

code

上使用groupby，在

上进行

分数计算，如果

分数大于阈值，则进行过滤：

stds = 1.0 
filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]

    Code  Year  Month  Day     Q      P
0    100  2017      1    4   2.0  42.90
1    100  2017      1    9   2.0  42.90
2    100  2017      1   18   1.0  45.05
3    100  2017      1   19   2.0  45.05
4    100  2017      1   20   1.0  45.05
5    100  2017      1   24  10.0  46.40
6    100  2017      1   26   1.0  46.40
11   100  2017      2    9   1.0  45.05
13   100  2017      3    8   1.0  49.90
14   100  2017      3   17   6.0  45.05
15   100  2017      3   24   1.0  45.05

filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()
                     P
Code Year Month           
100  2017 1      44.821429
          2      45.050000
          3      46.666667

IIUC您可以在

code

上使用groupby，在

上进行

分数计算，并在

分数大于阈值时进行过滤：

stds = 1.0 
filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]

    Code  Year  Month  Day     Q      P
0    100  2017      1    4   2.0  42.90
1    100  2017      1    9   2.0  42.90
2    100  2017      1   18   1.0  45.05
3    100  2017      1   19   2.0  45.05
4    100  2017      1   20   1.0  45.05
5    100  2017      1   24  10.0  46.40
6    100  2017      1   26   1.0  46.40
11   100  2017      2    9   1.0  45.05
13   100  2017      3    8   1.0  49.90
14   100  2017      3   17   6.0  45.05
15   100  2017      3   24   1.0  45.05

filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()
                     P
Code Year Month           
100  2017 1      44.821429
          2      45.050000
          3      46.666667

你的想法是对的。只需通过

取

异常值['P']

序列的布尔值，并通过

loc

过滤数据帧：

res = df.loc[~outliers['P']]\
        .groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean()

print(res)

   Code  Year  Month          P
0   100  2017      1  44.821429
1   100  2017      2  45.050000
2   100  2017      3  46.666667

你的想法是对的。只需通过

取

异常值['P']

序列的布尔值，并通过

loc

过滤数据帧：

res = df.loc[~outliers['P']]\
        .groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean()

print(res)

   Code  Year  Month          P
0   100  2017      1  44.821429
1   100  2017      2  45.050000
2   100  2017      3  46.666667