Spark和Python的协方差和相关性失败

Spark和Python的协方差和相关性失败,python,apache-spark,Python,Apache Spark,我是Spark的新手,我试图计算cov和cor。 但当我尝试求和时,apache向我显示了一个错误 我的代码: from pyspark.mllib.stat import Statistics rddX = sc.parallelize([1,2,3,4,5,6,7,8,9,10]) rddY = sc.parallelize([7,6,5,4,5,6,7,8,9,10]) XY= rddX.zip(rddY) Statistics.corr(XY) Meanx = rddX.sum()/

我是Spark的新手,我试图计算cov和cor。 但当我尝试求和时,apache向我显示了一个错误

我的代码:

from pyspark.mllib.stat import Statistics
rddX = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
rddY = sc.parallelize([7,6,5,4,5,6,7,8,9,10])
XY= rddX.zip(rddY)


Statistics.corr(XY)
Meanx = rddX.sum()/rddX.count()
Meany = rddY.sum()/rddY.count()
print(Meanx,Meany)
cov = XY.map(lambda x,y: (x-Meanx)*(y-Meany)).sum()
我的错误:

opt/ibm/spark/python/pyspark/rdd.py in fold(self, zeroValue, op)
    913         # zeroValue provided to each partition is unique from the one provided
    914         # to the final reduce call
--> 915         vals = self.mapPartitions(func).collect()
    916         return reduce(op, vals, zeroValue)

如果sum返回RRDD数组,为什么显示此错误?

尝试分离RDD的操作:

sumX = rddX.sum()
countX = rddX.count()
meanX = sumX / countX