Python 如何不删除异常值,而是通过使用pandas进行转换来处理异常值?

Python 如何不删除异常值,而是通过使用pandas进行转换来处理异常值?,python,python-3.x,pandas,dataframe,outliers,Python,Python 3.x,Pandas,Dataframe,Outliers,我有一个如下所示的数据帧 dfx = pd.DataFrame({'min_temp' :[-138,36,34,38,237,339]}) 如下所示,此数据中有三个异常值-138,237和239 我想做的是识别记录 a) 大于3标准偏差,并用有效最大值替换(考虑数据范围) b) 小于-3标准偏差,并用有效最小值替换(考虑数据范围) 这是我尝试的,但它是不正确的,没有效率 dfx.apply(lambda x: x[(x < dfx[min_temp].mean()-3*dfx[min_

我有一个如下所示的数据帧

dfx = pd.DataFrame({'min_temp' :[-138,36,34,38,237,339]})
如下所示,此数据中有三个异常值
-138
237
239

我想做的是识别记录

a) 大于
3标准偏差
,并用有效最大值替换(考虑数据范围)

b) 小于
-3标准偏差
,并用有效最小值替换(考虑数据范围)

这是我尝试的,但它是不正确的,没有效率

dfx.apply(lambda x: x[(x < dfx[min_temp].mean()-3*dfx[min_temp].std(), dfx[min_temp].mean()+3*dfx[min_temp].std())])
dfx.apply(λx:x[(x
在上面的示例中,38是最大值,因为它在
3sd
限制内,并且是有效的最大值(表示不是异常值)。同样,36是最小值,因为它位于
-3sd

我们需要使用它来替换完整数据帧中的所有异常值

请注意,在我的真实数据中,我有60多列和100万行。我想在所有栏目中都这样做。任何有效且可扩展的方法都是有帮助的

我希望我的输出是这样的?您可以看到异常值是如何被3sd内的
最大有效值替换的(在本例中为38)

你能帮我吗

建议解决方案后更新


这里有一个通用函数,它遵循以下逻辑来检测异常值

def cap_outliers(series, zscore_threshold=3, verbose=False):
    '''Caps outliers to closest existing value within threshold (Z-score).'''
    mean_val = series.mean()
    std_val = series.std()

    z_score = (series - mean_val) / std_val
    outliers = abs(z_score) > zscore_threshold

    series = series.copy()
    series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
    series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min()

    # For comparison purposes.
    if verbose:
            lbound = mean_val - zscore_threshold * std_val
            ubound = mean_val + zscore_threshold * std_val
            print('\n'.join(
                ['Capping outliers by the Z-score method:',
                 f'   Z-score threshold: {zscore_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)
此函数以dataframe作为参数,因此请确保只有数值列


对于每个数据点X:
abs(X-mean)这个答案基于关于异常值检测的好文章中的信息。您可以在那里了解每种方法。
每个代码的输出都显示了异常检测结果的上下限

首先,让我们定义一些示例数据:

import numpy as np

df = pd.DataFrame({'col1': np.random.normal(loc=20, scale=2, size=10)})

# Insert outliers
df['col1'][0] = 40
df['col1'][1] = 0

df['col1']
输出:

0    40.000000
1     0.000000
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64
Capping outliers by the Z-score method:
   Z-score threshold: 3
   Lower bound: -8.28385086324063
   Upper bound: 49.22620154113844

0    40.000000
1     0.000000
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64
Capping outliers by the Modified Z-score method:
   Z-score threshold: 3
   Lower bound: 5.538418022763285
   Upper bound: 36.19368140628174

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64
Capping outliers by the IQR method:
   IQR threshold: 1.5
   Lower bound: 15.464630871041477
   Upper bound: 26.331958943979345

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64
Z评分法 该方法是所有3种方法中鲁棒性最低的。它不适用于小数据集(平均值和标准偏差受异常值的严重影响)

改进的Z评分法 该方法比前一种方法更具鲁棒性。它使用中位数和mad,而不是平均值和std

def cap_outliers(series, zscore_threshold=3, verbose=False):
    '''Caps outliers to closest existing value within threshold (Modified Z-score).'''
    median_val = series.median()
    mad_val = series.mad() # Median absolute deviation

    z_score = (series - median_val) / mad_val
    outliers = abs(z_score) > zscore_threshold

    series = series.copy()
    series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
    series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min() 

    # For comparison purposes.
    if verbose:
            lbound = median_val - zscore_threshold * mad_val
            ubound = median_val + zscore_threshold * mad_val
            print('\n'.join(
                ['Capping outliers by the Modified Z-score method:',
                 f'   Z-score threshold: {zscore_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)
IQR方法 这是三种方法中最严格的一种

def cap_outliers(series, iqr_threshold=1.5, verbose=False):
    '''Caps outliers to closest existing value within threshold (IQR).'''
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lbound = Q1 - iqr_threshold * IQR
    ubound = Q3 + iqr_threshold * IQR

    outliers = (series < lbound) | (series > ubound)

    series = series.copy()
    series.loc[series < lbound] = series.loc[~outliers].min()
    series.loc[series > ubound] = series.loc[~outliers].max()

    # For comparison purposes.
    if verbose:
            print('\n'.join(
                ['Capping outliers by the IQR method:',
                 f'   IQR threshold: {iqr_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)
结论
您可能应该使用IQR方法。

谢谢您的回复。无人驾驶。将很快更新。实际上,我没有一个只包含数字列的数据帧。它有所有类型的列<代码>字符串
分类
整数
等感谢您的回复。向上投票。将尝试更新您您好,您错过了
min
条件。我的意思是,任何小于
-3sd
的值都应替换为
min
值,任何大于
+3sd
的值都应替换为
max
值。我把帖子更新为well@SSMK这是有道理的。请看编辑后的答案。嗨,我试过这个。看起来它并没有替换最小值。您也可以在示例数据中尝试。我想我们还需要有一个条件,用于
-3sd
def cap_outliers(series, iqr_threshold=1.5, verbose=False):
    '''Caps outliers to closest existing value within threshold (IQR).'''
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lbound = Q1 - iqr_threshold * IQR
    ubound = Q3 + iqr_threshold * IQR

    outliers = (series < lbound) | (series > ubound)

    series = series.copy()
    series.loc[series < lbound] = series.loc[~outliers].min()
    series.loc[series > ubound] = series.loc[~outliers].max()

    # For comparison purposes.
    if verbose:
            print('\n'.join(
                ['Capping outliers by the IQR method:',
                 f'   IQR threshold: {iqr_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)
Capping outliers by the IQR method:
   IQR threshold: 1.5
   Lower bound: 15.464630871041477
   Upper bound: 26.331958943979345

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64