Apache spark 在中收集失败。。。由于SparkContext已关闭，舞台被取消_Apache Spark_Pyspark_Rdd

Apache spark 在中收集失败。。。由于SparkContext已关闭，舞台被取消

apache-spark pyspark

Apache spark 在中收集失败。。。由于SparkContext已关闭，舞台被取消,apache-spark,pyspark,rdd,Apache Spark,Pyspark,Rdd,我想显示每个分区中的元素数，因此我编写以下代码： 19/02/18 21:41:15 INFO DAGScheduler: Job 3 failed: collect at /project/6008168/tamouze/testSparkCedar.py:435, took 30.859710 s 19/02/18 21:41:15 INFO DAGScheduler: ResultStage 3 (collect at /project/6008168/tamouze/testSparkC

我想显示每个分区中的元素数，因此我编写以下代码：

19/02/18 21:41:15 INFO DAGScheduler: Job 3 failed: collect at /project/6008168/tamouze/testSparkCedar.py:435, took 30.859710 s
19/02/18 21:41:15 INFO DAGScheduler: ResultStage 3 (collect at /project/6008168/tamouze/testSparkCedar.py:435) failed in 30.848 s due to Stage cancelled because SparkContext was shut down
19/02/18 21:41:15 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/02/18 21:41:16 INFO MemoryStore: MemoryStore cleared
19/02/18 21:41:16 INFO BlockManager: BlockManager stopped
19/02/18 21:41:16 INFO BlockManagerMaster: BlockManagerMaster stopped
19/02/18 21:41:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_14 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_14 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_3 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_3 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 INFO SparkContext: Successfully stopped SparkContext
....

分区计数器中的def计数：在迭代器中为u1生成sum1 如果我像这样使用它

每个分区中元素的printnumber:{}.format 我的_rdd.mappartitionscont_在_a_partition.collect中我得到以下信息：

19/02/18 21:41:15 INFO DAGScheduler: Job 3 failed: collect at /project/6008168/tamouze/testSparkCedar.py:435, took 30.859710 s
19/02/18 21:41:15 INFO DAGScheduler: ResultStage 3 (collect at /project/6008168/tamouze/testSparkCedar.py:435) failed in 30.848 s due to Stage cancelled because SparkContext was shut down
19/02/18 21:41:15 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/02/18 21:41:16 INFO MemoryStore: MemoryStore cleared
19/02/18 21:41:16 INFO BlockManager: BlockManager stopped
19/02/18 21:41:16 INFO BlockManagerMaster: BlockManagerMaster stopped
19/02/18 21:41:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_14 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_14 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_3 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_3 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 INFO SparkContext: Successfully stopped SparkContext
....

注意我的_rdd.take1返回：

如何解决此问题？

您必须使用glom函数来解决此问题。让我们举个例子

让我们先创建一个数据帧

rdd=sc.parallelize([('a',22),('b',1),('c',4),('b',1),('d',2),('e',0),('d',3),('a',1),('c',4),('b',7),('a',2),('f',1)] )
df=rdd.toDF(['key','value'])
df=df.repartition(5,"key") # Make 5 Partitions

分区数-

print("Number of partitions: {}".format(df.rdd.getNumPartitions())) 
    Number of partitions: 5

每个分区上的行数/元素数。这可以给你一个歪斜的概念-

print('Partitioning distribution: '+ str(df.rdd.glom().map(len).collect()))
    Partitioning distribution: [3, 3, 2, 2, 2]

查看分区上的行实际分布情况。注意，如果数据集很大，那么您的系统可能会因为内存不足而崩溃

OP使用的解决方案在性能方面是正确和最优的。因此，您必须使用glom是完全错误的，建议的解决方案比原来的解决方案严重，因为您已经注意到它必须加载整个分区，并且不能解决实际的故障源，公平地说，无法根据帖子中提供的信息进行判断。在当前状态下无法恢复。你能添加一个MVCE吗？这将有助于你写作

print("Partitions structure: {}".format(df.rdd.glom().collect()))
    Partitions structure: [
       #Partition 1        [Row(key='a', value=22), Row(key='a', value=1), Row(key='a', value=2)], 
       #Partition 2        [Row(key='b', value=1), Row(key='b', value=1), Row(key='b', value=7)], 
       #Partition 3        [Row(key='c', value=4), Row(key='c', value=4)], 
       #Partition 4        [Row(key='e', value=0), Row(key='f', value=1)], 
       #Partition 5        [Row(key='d', value=2), Row(key='d', value=3)]
                          ]