Python 数据帧-删除异常值_Python_Pandas_Scipy

Python 数据帧-删除异常值

python pandas

Python 数据帧-删除异常值,python,pandas,scipy,Python,Pandas,Scipy,给定一个数据帧，我想基于其中一列排除与异常值（Z-value=3）对应的行数据帧如下所示： df.dtypes _id object _index object _score object _source.address object _source.district object _source.price float64 _source.roomCount

给定一个数据帧，我想基于其中一列排除与异常值（Z-value=3）对应的行

数据帧如下所示：

df.dtypes
_id                   object
_index                object
_score                object
_source.address       object
_source.district      object
_source.price        float64
_source.roomCount    float64
_source.size         float64
_type                 object
sort                  object
priceSquareMeter     float64
dtype: object

对于线路：

dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]

是

为什么会出现上述异常，以及如何排除异常值？

每当出现此类问题时，请使用此布尔值：

df=pd.DataFrame({'Data':np.random.normal(size=200)})  #example 
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or the other way around

df=pd.DataFrame（{'Data'：np.random.normal（size=200）}）示例
df[np.abs（df.Data-df.Data.mean（））（3*df.Data.std（））]#或者反过来说

我相信您可以使用异常值创建一个布尔过滤器，然后选择它的位置

outliers = stats.zscore(df['_source.price']).apply(lambda x: np.abs(x) == 3)
df_without_outliers = df[~outliers]

如果想要使用给定数据集的属性（即IQR，如下所示）（）：

AttributeError:“DataFrame”对象没有属性“Data”

正是我要找的。竖起大拇指。谢谢

True

df=pd.DataFrame({'Data':np.random.normal(size=200)})  #example 
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or the other way around

outliers = stats.zscore(df['_source.price']).apply(lambda x: np.abs(x) == 3)
df_without_outliers = df[~outliers]

def Remove_Outlier_Indices(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    trueList = ~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))
    return trueList

# Arbitrary Dataset for the Example
df = pd.DataFrame({'Data':np.random.normal(size=200)})

# Index List of Non-Outliers
nonOutlierList = Remove_Outlier_Indices(df)

# Non-Outlier Subset of the Given Dataset
dfSubset = df[nonOutlierList]