Python 基于数据帧中的两列删除异常值

Python 基于数据帧中的两列删除异常值,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框,如下所示: Year Month Equipment Weight 2017 1 TennisBall 5 2017 1 Football 4 2017 1 TennisBall 6 2017 1 TennisBall 7 2017 1 TennisBall 300 2017 2 TennisBall 300 2018 2 TennisBall 250 2018 2 Football 5

我有一个数据框,如下所示:

Year Month Equipment   Weight
2017 1     TennisBall  5
2017 1     Football    4
2017 1     TennisBall  6
2017 1     TennisBall  7
2017 1     TennisBall  300
2017 2     TennisBall  300
2018 2     TennisBall  250
2018 2     Football    5
2018 2     TennisBall  6
2018 2     TennisBall  275
...
在上面的例子中,我们通常只在2月份发货300个网球,因此6个单位的订单是异常值,而在1月份,正常数量是~5个,使得该月任何较大的订单都是异常值。我想根据每个月的权重去掉异常值。有没有一个简单的方法可以做到这一点?我知道我可以做以下几点:

df1[np.abs(df1.Weight-df1.Weight.mean()) <= (5*df1.Weight.std())]

1月份的异常值300被删除(1月份高于正常值),2月份的异常值6被删除(1月份为正常值,但2月份为异常值)

这是groupby的问题。您可以通过创建两个包含分组平均值和标准偏差的新列,然后对这些列进行筛选来解决此问题:

# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')

# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]

# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)

您是否可以包括示例数据帧和所需输出。示例数据帧是第一个数据帧。我在末尾添加了所需的输出。谢谢
# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')

# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]

# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)
>>> print(df)

0   Year Month   Equipment  Weight
1   2017     1  TennisBall       5
2   2017     1    Football       4
3   2017     1  TennisBall       6
4   2017     1  TennisBall       7
6   2017     2  TennisBall     300
7   2018     2  TennisBall     250
8   2018     2    Football       5
10  2018     2  TennisBall     275