Python 执行分组后删除异常值_Python_Pandas_Group By_Statistics_Outliers

Python 执行分组后删除异常值

python pandas statistics

Python 执行分组后删除异常值,python,pandas,group-by,statistics,outliers,Python,Pandas,Group By,Statistics,Outliers,这是我的第一篇帖子，所以请对我放轻松我正试图绘制一张2000年至2015年每个国家预期寿命的方框图。我的CSV文件包含每个国家16次，每年1次。我使用df.boxplotby=['Country'，column='Life expective'绘制了这个框，我能够从中看到预期寿命的异常值。我还能够通过Q1=df.groupby'country['Life expective']得到每个国家的分位数。quantile0.25 Q3=df.groupby'country['Life expecti

这是我的第一篇帖子，所以请对我放轻松

我正试图绘制一张2000年至2015年每个国家预期寿命的方框图。我的CSV文件包含每个国家16次，每年1次。我使用df.boxplotby=['Country'，column='Life expective'绘制了这个框，我能够从中看到预期寿命的异常值。我还能够通过Q1=df.groupby'country['Life expective']得到每个国家的分位数。quantile0.25 Q3=df.groupby'country['Life expective']。quantile0.75我看了很多教程，但没有一个使用groupby，因此我陷入困境，不确定下一步该怎么做。欢迎提供任何帮助

定义一个函数，以返回具有上限和下限的数据帧（假设您只需要IQR，groupby），然后计算IQR，将这些列分配给df，最后，在值不是异常值的情况下执行查询：

def fun(serie):
    return pd.DataFrame([[serie.quantile(0.25), serie.quantile(0.75)]]
                             *serie.shape[0], 
                        columns=['lower', 'upper'], 
                        index=serie.index)

df[['lower', 'upper']] = df.groupby('Country')['Life Expectancy'].apply(fun)

df = df.query('lower <= `Life Expectancy` <= upper')
    .drop(columns=['lower', 'upper'])

定义一个函数以返回一个数据帧，其上限和下限假定您只需要IQR groupby，然后计算IQR，将这些列分配给df，最后，在值不是异常值的情况下执行查询：

def fun(serie):
    return pd.DataFrame([[serie.quantile(0.25), serie.quantile(0.75)]]
                             *serie.shape[0], 
                        columns=['lower', 'upper'], 
                        index=serie.index)

df[['lower', 'upper']] = df.groupby('Country')['Life Expectancy'].apply(fun)

df = df.query('lower <= `Life Expectancy` <= upper')
    .drop(columns=['lower', 'upper'])

如果您试图删除异常值，我将使用zscore而不是分位数

from scipy import stats

df['outlier'] = (np.abs(stats.zscore(df['Life Expectancy'])) >= 3) # replace 3 with a threshold of your choice
new_df= df[df['outlier']==False].copy()

但是，由于您希望在groupby对象上执行此操作，因此可以使用

df.groupby('Country')['Life Expectancy'].transform(lambda x : stats.zscore(x,ddof=1))

如果您试图删除异常值，我将使用zscore而不是分位数

from scipy import stats

df['outlier'] = (np.abs(stats.zscore(df['Life Expectancy'])) >= 3) # replace 3 with a threshold of your choice
new_df= df[df['outlier']==False].copy()

但是，由于您希望在groupby对象上执行此操作，因此可以使用

df.groupby('Country')['Life Expectancy'].transform(lambda x : stats.zscore(x,ddof=1))

很好的解决方案，不知道ScipyzCore，谢谢！我也喜欢你的！很好的解决方案，不知道ScipyzCore，谢谢！我也喜欢你的！嗨，这两种解决方案对你有用吗？嗨，这两种解决方案对你有用吗？