利用Pyspark中的平均值进行异常值处理
我的数据框看起来像-利用Pyspark中的平均值进行异常值处理,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我的数据框看起来像- id gender age 1 m 27 2 m 39 3 f 99 4 f 11 5 m 46 6 f 60 我希望我的最终数据帧看起来像- id gender age new_age 1 m
id gender age
1 m 27
2 m 39
3 f 99
4 f 11
5 m 46
6 f 60
我希望我的最终数据帧看起来像-
id gender age new_age
1 m 27 27
2 m 39 39
3 f 99 43
4 f 11 43
5 m 46 46
6 f 60 60
我的密码-
from pyspark.sql.functions import mean as _mean, stddev as _stddev, col
condition = ((df['age'] >= 18 & df['age'] <=60))
df = df.withColumn("new_age", when(condition, (col("age"))).otherwise(_mean(col('age')))
从pyspark.sql.functions导入mean作为_-mean,stddev作为_-stddev,col
条件=((df['age']>=18&df['age']以下是一种方法:
from operator import add
# convert outliers into a list of strings
outliers = [11,99]
outliers_str = '|'.join([str(i) for i in outliers])
# calculate mean without outlier values
mean_val = df.select("age").rdd.flatMap(lambda x: [i for i in x if i not in outliers]).mean()
# replace mean with outlier values
df = df.withColumn('new_age', F.regexp_replace('age', outliers_str, f'{mean_val}').cast('int'))
+---+------+---+-------+
| id|gender|age|new_age|
+---+------+---+-------+
| 1| m| 27| 27|
| 2| m| 39| 39|
| 3| f| 99| 43|
| 4| f| 11| 43|
| 5| m| 46| 46|
| 6| f| 60| 60|
+---+------+---+-------+