Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala apachespark中的共现图RpcTimeoutException_Scala_Apache Spark - Fatal编程技术网

Scala apachespark中的共现图RpcTimeoutException

Scala apachespark中的共现图RpcTimeoutException,scala,apache-spark,Scala,Apache Spark,我有一个从documentId映射到实体的文件,并且我提取文档共现。实体RDD如下所示: //documentId -> (name, type, frequency per document) val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])] val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializ

我有一个从documentId映射到实体的文件,并且我提取文档共现。实体RDD如下所示:

//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])] 
val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .setAppName("wordCount")
  .setMaster("local[8]")
  .set("spark.executor.memory", "8g")
  .set("spark.driver.maxResultSize", "8g")
  // Increase memory fraction to prevent disk spilling
  .set("spark.shuffle.memoryFraction", "0.3")
  // Disable spilling
  // If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
  // This spilling threshold is specified by spark.shuffle.memoryFraction.
  .set("spark.shuffle.spill", "false")
为了提取实体之间的关系及其在每个文档中的频率,我使用以下代码:

def hashId(str: String) = {
    Hashing.md5().hashString(str, Charsets.UTF_8).asLong()
}

val docRelTupleRDD = docEntityTupleRDD
  //flatMap at SampleGraph.scala:62
  .flatMap { case(docId, entities) =>
    val entitiesWithId = entities.map { case(name, _, freq) => (hashId(name), freq) }.toList
    val relationships = entitiesWithId.combinations(2).collect {
      case Seq((id1, freq1), (id2, freq2)) if id1 != id2 =>
        // Make sure left side is less than right side
        val (first, second) = if (id1 < id2) (id1, id2) else (id2, id1)
        ((first, second), (docId.toInt, freq1 * freq2))
    }
    relationships
  }


val zero = collection.mutable.Map[Int, Int]()
val edges: RDD[Edge[immutable.Map[Int, Int]]] = docRelTupleRDD
  .aggregateByKey(zero)(
    (map, v) => map += v,
    (map1, map2) => map1 ++= map2
  )
  .map { case ((e1, e2), freqMap) => Edge(e1, e2, freqMap.toMap) }
一段时间后,我收到以下错误:

15/12/28 02:39:40 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 127198 ms exceeds timeout 120000 ms
15/12/28 02:39:40 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 127198 ms
15/12/28 02:39:40 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 0.0
15/12/28 02:42:50 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:42:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): ExecutorLostFailure (executor driver lost)
15/12/28 02:43:55 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
15/12/28 02:46:04 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): ExecutorLostFailure (executor driver lost)
[...]
15/12/28 02:47:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/28 02:48:36 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:49:39 INFO TaskSchedulerImpl: Cancelling stage 0
15/12/28 02:49:39 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at SampleGraph.scala:62) failed in 3321.145 s
15/12/28 02:51:06 WARN SparkContext: Killing executors is only supported in coarse-grained mode
[...]
我的spark配置如下所示:

//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])] 
val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .setAppName("wordCount")
  .setMaster("local[8]")
  .set("spark.executor.memory", "8g")
  .set("spark.driver.maxResultSize", "8g")
  // Increase memory fraction to prevent disk spilling
  .set("spark.shuffle.memoryFraction", "0.3")
  // Disable spilling
  // If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
  // This spilling threshold is specified by spark.shuffle.memoryFraction.
  .set("spark.shuffle.spill", "false")

我已经增加了executor内存,并在internet上使用
aggregateByKey
重构了以前的
reduceByKey
构造。错误保持不变。有人能帮我吗?

很明显,每个文档的代码都是O(N^2)。因此,您可以尝试的第一件事是根据平均文档大小按比例增加分区数。我会小心使用spark.shuffle.spill。如果输出太大,无法放入内存,数据无论如何都会进入磁盘,但条件不太理想。另外,`if(id1aggregateByKey(零,numPartitions=44)。还是一样的例外。您是指按“文档大小”或文档数量划分的每个文档的实体吗?这在我的案件中起到了法官的作用。我也研究过这个。他们建议将父RDD中的分区数(
docRelTupleRDD.partitions.size
->13)乘以1.5。不确定,如果44个分区仍然不足。为什么
id1
aggregateByKey
为时已晚。你甚至在
flatMap
之前就想要这个。最初有多少个分区(以及有多少数据)?
Iterable[(String,String,Int)]
的平均大小是多少?它已过时,因为它已被
组合
覆盖。