Arrays 如何在pyspark中查找列中列表的平均值?
我的dataframe如下所示。我希望能够找到一个平均值,并放入一个新的_列。我可以使用udf找到avg,但无法将其放入列中。如果你能在没有udf的情况下提供帮助,那就太好了。否则,欢迎对当前解决方案提供任何帮助Arrays 如何在pyspark中查找列中列表的平均值?,arrays,apache-spark,pyspark,user-defined-functions,mean,Arrays,Apache Spark,Pyspark,User Defined Functions,Mean,我的dataframe如下所示。我希望能够找到一个平均值,并放入一个新的_列。我可以使用udf找到avg,但无法将其放入列中。如果你能在没有udf的情况下提供帮助,那就太好了。否则,欢迎对当前解决方案提供任何帮助 from pyspark.sql.types import StructType,StructField from pyspark.sql.types import StringType, IntegerType, ArrayType data = [ ("Smith&qu
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = [
("Smith","[55, 65, 75]"),
("Anna","[33, 44, 55]"),
("Williams","[9.5, 4.5, 9.7]"),
]
schema = StructType([
StructField('name', StringType(), True),
StructField('some_value', StringType(), True)
])
df = spark.createDataFrame(data = data, schema= schema)
df.show(truncate=False)
+--------+---------------+
|name |some_value |
+--------+---------------+
|Smith |[55, 65, 75] |
|Anna |[33, 44, 55] |
|Williams|[9.5, 4.5, 9.7]|
+--------+---------------+
解决办法是这样的,
array_mean = F.udf(lambda x: float(np.mean(x)), FloatType())
from返回数据帧,而不是新列
欢迎任何帮助。谢谢。您有一个看起来像数组的字符串列,而不是数组列,因此您还需要转换UDF中的数据类型:
import json
import numpy as np
import pyspark.sql.functions as F
array_mean = F.udf(lambda x: float(np.mean(json.loads(x))), 'float')
df2 = df.withColumn('mean_value', array_mean('some_value'))
df2.show()
+--------+---------------+----------+
| name| some_value|mean_value|
+--------+---------------+----------+
| Smith| [55, 65, 75]| 65.0|
| Anna| [33, 44, 55]| 44.0|
|Williams|[9.5, 4.5, 9.7]| 7.9|
+--------+---------------+----------+
您有一个看起来像数组的字符串列,而不是数组列,因此您还需要转换UDF中的数据类型:
import json
import numpy as np
import pyspark.sql.functions as F
array_mean = F.udf(lambda x: float(np.mean(json.loads(x))), 'float')
df2 = df.withColumn('mean_value', array_mean('some_value'))
df2.show()
+--------+---------------+----------+
| name| some_value|mean_value|
+--------+---------------+----------+
| Smith| [55, 65, 75]| 65.0|
| Anna| [33, 44, 55]| 44.0|
|Williams|[9.5, 4.5, 9.7]| 7.9|
+--------+---------------+----------+
从Pandasand新手到Pypark,我走了很长的路 条带[] 拆分成列表 爆发 卑鄙
从Pandasand新手到Pypark,我走了很长的路 条带[] 拆分成列表 爆发 卑鄙