Arrays 如何在pyspark中查找列中列表的平均值?

Arrays 如何在pyspark中查找列中列表的平均值?,arrays,apache-spark,pyspark,user-defined-functions,mean,Arrays,Apache Spark,Pyspark,User Defined Functions,Mean,我的dataframe如下所示。我希望能够找到一个平均值,并放入一个新的_列。我可以使用udf找到avg,但无法将其放入列中。如果你能在没有udf的情况下提供帮助,那就太好了。否则,欢迎对当前解决方案提供任何帮助 from pyspark.sql.types import StructType,StructField from pyspark.sql.types import StringType, IntegerType, ArrayType data = [ ("Smith&qu

我的dataframe如下所示。我希望能够找到一个平均值,并放入一个新的_列。我可以使用udf找到avg,但无法将其放入列中。如果你能在没有udf的情况下提供帮助,那就太好了。否则,欢迎对当前解决方案提供任何帮助

from pyspark.sql.types import StructType,StructField 
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = [
("Smith","[55, 65, 75]"),
("Anna","[33, 44, 55]"),
("Williams","[9.5, 4.5, 9.7]"),    
]
     
schema = StructType([
StructField('name', StringType(), True),
StructField('some_value', StringType(), True)
])

df = spark.createDataFrame(data = data, schema= schema)
df.show(truncate=False)

+--------+---------------+
|name    |some_value     |
+--------+---------------+
|Smith   |[55, 65, 75]   |
|Anna    |[33, 44, 55]   |
|Williams|[9.5, 4.5, 9.7]|
+--------+---------------+
解决办法是这样的,

array_mean = F.udf(lambda x: float(np.mean(x)), FloatType())
from返回数据帧,而不是新列


欢迎任何帮助。谢谢。

您有一个看起来像数组的字符串列,而不是数组列,因此您还需要转换UDF中的数据类型:

import json
import numpy as np
import pyspark.sql.functions as F

array_mean = F.udf(lambda x: float(np.mean(json.loads(x))), 'float')
df2 = df.withColumn('mean_value', array_mean('some_value'))

df2.show()
+--------+---------------+----------+
|    name|     some_value|mean_value|
+--------+---------------+----------+
|   Smith|   [55, 65, 75]|      65.0|
|    Anna|   [33, 44, 55]|      44.0|
|Williams|[9.5, 4.5, 9.7]|       7.9|
+--------+---------------+----------+

您有一个看起来像数组的字符串列,而不是数组列,因此您还需要转换UDF中的数据类型:

import json
import numpy as np
import pyspark.sql.functions as F

array_mean = F.udf(lambda x: float(np.mean(json.loads(x))), 'float')
df2 = df.withColumn('mean_value', array_mean('some_value'))

df2.show()
+--------+---------------+----------+
|    name|     some_value|mean_value|
+--------+---------------+----------+
|   Smith|   [55, 65, 75]|      65.0|
|    Anna|   [33, 44, 55]|      44.0|
|Williams|[9.5, 4.5, 9.7]|       7.9|
+--------+---------------+----------+

从Pandasand新手到Pypark,我走了很长的路

条带[]

拆分成列表

爆发

卑鄙


从Pandasand新手到Pypark,我走了很长的路

条带[]

拆分成列表

爆发

卑鄙