Scala apachespark中的共现图RpcTimeoutException_Scala_Apache Spark

Scala apachespark中的共现图RpcTimeoutException

scala apache-spark

Scala apachespark中的共现图RpcTimeoutException,scala,apache-spark,Scala,Apache Spark,我有一个从documentId映射到实体的文件，并且我提取文档共现。实体RDD如下所示： //documentId -> (name, type, frequency per document) val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])] val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializ

我有一个从documentId映射到实体的文件，并且我提取文档共现。实体RDD如下所示：

//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])]

val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .setAppName("wordCount")
  .setMaster("local[8]")
  .set("spark.executor.memory", "8g")
  .set("spark.driver.maxResultSize", "8g")
  // Increase memory fraction to prevent disk spilling
  .set("spark.shuffle.memoryFraction", "0.3")
  // Disable spilling
  // If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
  // This spilling threshold is specified by spark.shuffle.memoryFraction.
  .set("spark.shuffle.spill", "false")

为了提取实体之间的关系及其在每个文档中的频率，我使用以下代码：

def hashId(str: String) = {
    Hashing.md5().hashString(str, Charsets.UTF_8).asLong()
}

val docRelTupleRDD = docEntityTupleRDD
  //flatMap at SampleGraph.scala:62
  .flatMap { case(docId, entities) =>
    val entitiesWithId = entities.map { case(name, _, freq) => (hashId(name), freq) }.toList
    val relationships = entitiesWithId.combinations(2).collect {
      case Seq((id1, freq1), (id2, freq2)) if id1 != id2 =>
        // Make sure left side is less than right side
        val (first, second) = if (id1 < id2) (id1, id2) else (id2, id1)
        ((first, second), (docId.toInt, freq1 * freq2))
    }
    relationships
  }


val zero = collection.mutable.Map[Int, Int]()
val edges: RDD[Edge[immutable.Map[Int, Int]]] = docRelTupleRDD
  .aggregateByKey(zero)(
    (map, v) => map += v,
    (map1, map2) => map1 ++= map2
  )
  .map { case ((e1, e2), freqMap) => Edge(e1, e2, freqMap.toMap) }

一段时间后，我收到以下错误：

15/12/28 02:39:40 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 127198 ms exceeds timeout 120000 ms
15/12/28 02:39:40 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 127198 ms
15/12/28 02:39:40 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 0.0
15/12/28 02:42:50 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:42:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): ExecutorLostFailure (executor driver lost)
15/12/28 02:43:55 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
15/12/28 02:46:04 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): ExecutorLostFailure (executor driver lost)
[...]
15/12/28 02:47:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/28 02:48:36 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:49:39 INFO TaskSchedulerImpl: Cancelling stage 0
15/12/28 02:49:39 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at SampleGraph.scala:62) failed in 3321.145 s
15/12/28 02:51:06 WARN SparkContext: Killing executors is only supported in coarse-grained mode
[...]

我的spark配置如下所示：

//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])]

val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .setAppName("wordCount")
  .setMaster("local[8]")
  .set("spark.executor.memory", "8g")
  .set("spark.driver.maxResultSize", "8g")
  // Increase memory fraction to prevent disk spilling
  .set("spark.shuffle.memoryFraction", "0.3")
  // Disable spilling
  // If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
  // This spilling threshold is specified by spark.shuffle.memoryFraction.
  .set("spark.shuffle.spill", "false")

我已经增加了executor内存，并在internet上使用

aggregateByKey

重构了以前的

reduceByKey

构造。错误保持不变。有人能帮我吗？

很明显，每个文档的代码都是O（N^2）。因此，您可以尝试的第一件事是根据平均文档大小按比例增加分区数。我会小心使用spark.shuffle.spill。如果输出太大，无法放入内存，数据无论如何都会进入磁盘，但条件不太理想。另外，`if（id1aggregateByKey（零，numPartitions=44）。还是一样的例外。您是指按“文档大小”或文档数量划分的每个文档的实体吗？这在我的案件中起到了法官的作用。我也研究过这个。他们建议将父RDD中的分区数（

docRelTupleRDD.partitions.size

->13）乘以1.5。不确定，如果44个分区仍然不足。为什么

id1aggregateByKey
为时已晚。你甚至在flatMap
之前就想要这个。最初有多少个分区（以及有多少数据）？Iterable[（String，String，Int）]
的平均大小是多少？它已过时，因为它已被组合
覆盖。