Python 如何不删除异常值，而是通过使用pandas进行转换来处理异常值？_Python_Python 3.x_Pandas_Dataframe_Outliers

Python 如何不删除异常值，而是通过使用pandas进行转换来处理异常值？

python python-3.x pandas dataframe

Python 如何不删除异常值，而是通过使用pandas进行转换来处理异常值？,python,python-3.x,pandas,dataframe,outliers,Python,Python 3.x,Pandas,Dataframe,Outliers,我有一个如下所示的数据帧 dfx = pd.DataFrame({'min_temp' :[-138,36,34,38,237,339]}) 如下所示，此数据中有三个异常值-138，237和239 我想做的是识别记录 a）大于3标准偏差，并用有效最大值替换（考虑数据范围） b）小于-3标准偏差，并用有效最小值替换（考虑数据范围）这是我尝试的，但它是不正确的，没有效率 dfx.apply(lambda x: x[(x < dfx[min_temp].mean()-3*dfx[min_

我有一个如下所示的数据帧

dfx = pd.DataFrame({'min_temp' :[-138,36,34,38,237,339]})

如下所示，此数据中有三个异常值

-138

，

和

我想做的是识别记录

a）大于

3标准偏差

，并用有效最大值替换（考虑数据范围）

b）小于

-3标准偏差

，并用有效最小值替换（考虑数据范围）

这是我尝试的，但它是不正确的，没有效率

dfx.apply(lambda x: x[(x < dfx[min_temp].mean()-3*dfx[min_temp].std(), dfx[min_temp].mean()+3*dfx[min_temp].std())])

dfx.apply（λx:x[（x


在上面的示例中，38是最大值，因为它在3sd
限制内，并且是有效的最大值（表示不是异常值）。同样，36是最小值，因为它位于-3sd

我们需要使用它来替换完整数据帧中的所有异常值
请注意，在我的真实数据中，我有60多列和100万行。我想在所有栏目中都这样做。任何有效且可扩展的方法都是有帮助的
我希望我的输出是这样的？您可以看到异常值是如何被3sd内的最大有效值替换的（在本例中为38）


你能帮我吗
建议解决方案后更新
这里有一个通用函数，它遵循以下逻辑来检测非异常值
def cap_outliers(series, zscore_threshold=3, verbose=False):
    '''Caps outliers to closest existing value within threshold (Z-score).'''
    mean_val = series.mean()
    std_val = series.std()

    z_score = (series - mean_val) / std_val
    outliers = abs(z_score) > zscore_threshold

    series = series.copy()
    series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
    series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min()

    # For comparison purposes.
    if verbose:
            lbound = mean_val - zscore_threshold * std_val
            ubound = mean_val + zscore_threshold * std_val
            print('\n'.join(
                ['Capping outliers by the Z-score method:',
                 f'   Z-score threshold: {zscore_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)

此函数以dataframe作为参数，因此请确保只有数值列
对于每个数据点X:abs（X-mean）这个答案基于关于异常值检测的好文章中的信息。您可以在那里了解每种方法。

每个代码的输出都显示了异常检测结果的上下限
首先，让我们定义一些示例数据：
import numpy as np

df = pd.DataFrame({'col1': np.random.normal(loc=20, scale=2, size=10)})

# Insert outliers
df['col1'][0] = 40
df['col1'][1] = 0

df['col1']

输出：
0    40.000000
1     0.000000
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

Capping outliers by the Z-score method:
   Z-score threshold: 3
   Lower bound: -8.28385086324063
   Upper bound: 49.22620154113844

0    40.000000
1     0.000000
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

Capping outliers by the Modified Z-score method:
   Z-score threshold: 3
   Lower bound: 5.538418022763285
   Upper bound: 36.19368140628174

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

Capping outliers by the IQR method:
   IQR threshold: 1.5
   Lower bound: 15.464630871041477
   Upper bound: 26.331958943979345

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

Z评分法
该方法是所有3种方法中鲁棒性最低的。它不适用于小数据集（平均值和标准偏差受异常值的严重影响）
改进的Z评分法
该方法比前一种方法更具鲁棒性。它使用中位数和mad，而不是平均值和std
def cap_outliers(series, zscore_threshold=3, verbose=False):
    '''Caps outliers to closest existing value within threshold (Modified Z-score).'''
    median_val = series.median()
    mad_val = series.mad() # Median absolute deviation

    z_score = (series - median_val) / mad_val
    outliers = abs(z_score) > zscore_threshold

    series = series.copy()
    series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
    series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min() 

    # For comparison purposes.
    if verbose:
            lbound = median_val - zscore_threshold * mad_val
            ubound = median_val + zscore_threshold * mad_val
            print('\n'.join(
                ['Capping outliers by the Modified Z-score method:',
                 f'   Z-score threshold: {zscore_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)

IQR方法
这是三种方法中最严格的一种
def cap_outliers(series, iqr_threshold=1.5, verbose=False):
    '''Caps outliers to closest existing value within threshold (IQR).'''
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lbound = Q1 - iqr_threshold * IQR
    ubound = Q3 + iqr_threshold * IQR

    outliers = (series < lbound) | (series > ubound)

    series = series.copy()
    series.loc[series < lbound] = series.loc[~outliers].min()
    series.loc[series > ubound] = series.loc[~outliers].max()

    # For comparison purposes.
    if verbose:
            print('\n'.join(
                ['Capping outliers by the IQR method:',
                 f'   IQR threshold: {iqr_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)

结论
您可能应该使用IQR方法。
谢谢您的回复。无人驾驶。将很快更新。实际上，我没有一个只包含数字列的数据帧。它有所有类型的列<代码>字符串
，分类
，整数
等感谢您的回复。向上投票。将尝试更新您您好，您错过了min
条件。我的意思是，任何小于-3sd
的值都应替换为min
值，任何大于+3sd
的值都应替换为max
值。我把帖子更新为well@SSMK这是有道理的。请看编辑后的答案。嗨，我试过这个。看起来它并没有替换最小值。您也可以在示例数据中尝试。我想我们还需要有一个条件，用于-3sd
def cap_outliers(series, iqr_threshold=1.5, verbose=False):
    '''Caps outliers to closest existing value within threshold (IQR).'''
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lbound = Q1 - iqr_threshold * IQR
    ubound = Q3 + iqr_threshold * IQR

    outliers = (series < lbound) | (series > ubound)

    series = series.copy()
    series.loc[series < lbound] = series.loc[~outliers].min()
    series.loc[series > ubound] = series.loc[~outliers].max()

    # For comparison purposes.
    if verbose:
            print('\n'.join(
                ['Capping outliers by the IQR method:',
                 f'   IQR threshold: {iqr_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)

Capping outliers by the IQR method:
   IQR threshold: 1.5
   Lower bound: 15.464630871041477
   Upper bound: 26.331958943979345

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64