Apache spark 对象在尝试收集RDD、pyspark时不可编辑_Apache Spark_Pyspark_Reduce_Iterable_Collect

Apache spark 对象在尝试收集RDD、pyspark时不可编辑

apache-spark pyspark

Apache spark 对象在尝试收集RDD、pyspark时不可编辑,apache-spark,pyspark,reduce,iterable,collect,Apache Spark,Pyspark,Reduce,Iterable,Collect,我是超级新的火花。在将顶级外部函数传递到RDD_old.reduceByKey后，我试图从RDD_new收集结果时发生了此错误首先，我定义了一个树结构： class treeStruct(object): def __init__(self,node,edge): self.node = nodeDictionary self.edge = edgeDictionary 之后，我使用sc.parallelize将两个TreeStruct转换为RDD：

我是超级新的火花。在将顶级外部函数传递到RDD_old.reduceByKey后，我试图从RDD_new收集结果时发生了此错误

首先，我定义了一个树结构：

class treeStruct(object):
    def __init__(self,node,edge):
        self.node = nodeDictionary
        self.edge = edgeDictionary

之后，我使用sc.parallelize将两个TreeStruct转换为RDD：

RDD = sc.parallelize([treeStruct1,treeStruct2])

然后，我将在驱动程序代码之外定义的顶级函数传递给reduceByKey。该函数包含几个用于迭代的函数，如：

def func(tree1,tree2):
    if conditions according to certain attributes of the RDD:
        for dummy:
             do something to the RDD attributes
    if conditions according to certain attributes of the RDD:
        for dummy2:
             do something to the RDD attributes

当我试图收集结果时，出现了以下错误：

Driver stacktrace:
17/03/07 13:38:37 INFO DAGScheduler: Job 0 failed: collect at /mnt/hgfs/VMshare/ditto-dev/pkltreeSpark_RDD.py:196, took 3.088593 s
Traceback (most recent call last):
  File "/mnt/hgfs/VMshare/pkltreeSpark_RDD.py", line 205, in <module>
startTesting(1,1)
  File "/mnt/hgfs/VMshare/pkltreeSpark_RDD.py", line 196, in startTesting
tmp = matchingOutcome.collect()
  File "/usr/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 809, in collect
  File "/usr/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/spark/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
  File "/usr/spark/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/usr/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 346, in func
  File "/usr/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1828, in combineLocally
  File "/usr/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
TypeError: 'treeStruct' object is not iterable

困惑。这是否意味着我不应该在函数内部使用迭代？或者我不应该像现在这样构造我的对象

同样，这个错误与如何迭代RDD的某些属性有关，而与键值对无关

任何帮助都会很好

我终于明白了这个问题是由我的类定义引起的，我想迭代这个树结构，它没有任何迭代器，是不可迭代的。所以这个问题可以通过向类中添加迭代器来解决

class treeStruct(object):
    def __init__(self,node,edge):
        self.node = nodeDictionary
        self.edge = edgeDictionary

    # add an iterator
    def __iter__(self):
        for x in [self.node,self.edge]:
            yield x

无论如何，谢谢你们的帮助