Apache spark 如何在spark中连接两个大型数据集_Apache Spark_Yarn

Apache spark 如何在spark中连接两个大型数据集

apache-spark

Apache spark 如何在spark中连接两个大型数据集,apache-spark,yarn,Apache Spark,Yarn,我目前可以处理大型数据集，其中一个像670Gb的拼花地板，带有200个唯一键的snappy文件，10000个分区，其他数据集也很大，有100多个键和200个分区，但不能用作join中的广播表 --conf spark.sql.shuffle.partitions=4000 --conf spark.executor.memory=24g --conf spark.yarn.executor.memoryOverhead=2048 --executor-cores 1 我可以为每个节点提供最多

我目前可以处理大型数据集，其中一个像670Gb的拼花地板，带有200个唯一键的snappy文件，10000个分区，其他数据集也很大，有100多个键和200个分区，但不能用作join中的广播表

--conf spark.sql.shuffle.partitions=4000
--conf spark.executor.memory=24g
--conf spark.yarn.executor.memoryOverhead=2048
--executor-cores 1

我可以为每个节点提供最多28 Gb的容器。我有570个节点的集群和128Gb的ram。我应该如何处理数据集以加入操作

[Stage 3:>                                                   (68 + 717) / 10136]

现在我不知道为什么我会得到717活动任务，因为我每个节点只运行一个执行器（--executor cores 1），您能帮助我理解这一点吗。请建议我应该如何执行此联接操作，因为如果我正确，则每个键的数据都足够大，无法容纳在一个节点中。我犯了一个错误

java.nio.channels.ClosedChannelException
17/07/16 02:49:10 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.

尝试用相同的列对两个数据集进行分区并连接它们after@AlbertoBonsanto我已经尝试过了，但675Gb数据集只有200个唯一密钥，因为每个分区的数据量非常大，可能不适合一个执行器。