Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark-带ArrayType列的折叠和求和_Python_Apache Spark_Pyspark_Rdd_Fold - Fatal编程技术网

Python pyspark-带ArrayType列的折叠和求和

Python pyspark-带ArrayType列的折叠和求和,python,apache-spark,pyspark,rdd,fold,Python,Apache Spark,Pyspark,Rdd,Fold,我试图做一个元素的求和,我已经创建了这个虚拟df。输出应为[10,4,4,1] from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType data = [ ("James",[1,1,1,1]), ("James",[2,1,1,0]), ("James",[3,1,1,0]), (&q

我试图做一个元素的求和,我已经创建了这个虚拟df。输出应为
[10,4,4,1]

from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType
data = [
    ("James",[1,1,1,1]),
    ("James",[2,1,1,0]),
    ("James",[3,1,1,0]),
    ("James",[4,1,1,0])
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("scores", ArrayType(IntegerType()), True) \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
posexplode可以工作,但我的实际df太大,所以我尝试使用fold,但它给了我一个错误。有什么想法吗?谢谢

vec_df = df.select("scores")
vec_sums = vec_df.rdd.fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])
文件“”,第2行,在

TypeError:不支持+:“int”和“list”的操作数类型


在折叠之前,需要将行的RDD映射到列表的RDD:

vec_sums = vec_df.rdd.map(lambda x: x[0]).fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])
为了帮助理解,您可以查看RDD的外观

>>> vec_df.rdd.collect()
[Row(scores=[1, 1, 1, 1]), Row(scores=[2, 1, 1, 0]), Row(scores=[3, 1, 1, 0]), Row(scores=[4, 1, 1, 0])]

>>> vec_df.rdd.map(lambda x: x[0]).collect()
[[1, 1, 1, 1], [2, 1, 1, 0], [3, 1, 1, 0], [4, 1, 1, 0]]
因此您可以想象
vec_df.rdd
包含一个嵌套列表,需要在
fold
之前取消该列表