Apache spark Kryo序列化失败：缓冲区溢出。可用：0，必需：110581_Apache Spark_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch_Hadoop_Hdfs

Apache spark Kryo序列化失败：缓冲区溢出。可用：0，必需：110581

apache-spark hadoop

Apache spark Kryo序列化失败：缓冲区溢出。可用：0，必需：110581,apache-spark,elasticsearch,hadoop,hdfs,Apache Spark,elasticsearch,Hadoop,Hdfs,我尝试使用Spark将数据写入Elasticsearch。我的文件夹在HDFS中大约1GB的数据，有很多文件txt（8000文件）。但当我提交作业时，出现了以下错误： 21/04/05 01:13:38 ERROR TaskSchedulerImpl: Lost executor 2 on mynode1: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issu

我尝试使用Spark将数据写入Elasticsearch。我的文件夹在HDFS中大约1GB的数据，有很多文件txt（8000文件）。但当我提交作业时，出现了以下错误：

21/04/05 01:13:38 ERROR TaskSchedulerImpl: Lost executor 2 on mynode1: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
21/04/05 01:13:53 ERROR TaskSchedulerImpl: Lost executor 0 on mynode1: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
21/04/05 01:16:09 ERROR TaskSchedulerImpl: Lost executor 3 on mynode2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
21/04/05 01:16:09 ERROR TaskSchedulerImpl: Lost executor 4 on mynode1: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
21/04/05 01:18:54 ERROR TaskSchedulerImpl: Lost executor 5 on mynode1: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
21/04/05 01:20:00 ERROR TaskSetManager: Task 1 in stage 1.2 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.2 failed 4 times, most recent failure: Lost task 1.3 in stage 1.2 (TID 28, 192.168.5.106, executor 6): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 110581. To avoid this, increase spark.kryoserializer.buffer.max value

我的代码：

import org.apache.spark.sql._
    import org.elasticsearch.spark.rdd.EsSpark
    import org.apache.spark.{ SparkConf, SparkContext }
    
    object index_to_es {
      def main(args: Array[String]): Unit = {
        println("Start app ...")
    
        // Create spark session
        val spark = SparkSession.builder()
                      .master("spark://master:7077")
                      .appName("Spark - Index to ES")
                      .config("spark.es.nodes","node1")
                      .config("spark.es.port","9200")
                      .config("es.batch.size.entries", "1000")
    
    //                  .config("spark.es.nodes.wan.only","true") //Needed for ES on AWS
                      .getOrCreate()
    
        spark.sparkContext.setLogLevel("ERROR")
    
        // Write dataframe to ES
        println("Staring indexing to ES ...")
    
        val startTimeMillis = System.currentTimeMillis()
    
        // Create spark context
        val sc = spark.sparkContext
        sc.setLogLevel("ERROR")
        val rddFromFile = spark.sparkContext.wholeTextFiles("hdfs://master:9000/bigdata/bigger").repartition(2)
        var listDoc = List[IndexDocument]()
        rddFromFile.collect().foreach(f=>{
          val indexDoc = IndexDocument(f._1, f._2)
          listDoc = listDoc :+ indexDoc
    
    //      val rdd = sc.makeRDD(Seq(indexDoc))
    //      EsSpark.saveToEs(rdd, "bigger")
        })
        val rdd = sc.makeRDD(listDoc)
        EsSpark.saveToEs(rdd, "bigdata")
    
        val endTimeMillis = System.currentTimeMillis()
        val durationSeconds = (endTimeMillis - startTimeMillis) / 1000
    
        println("Indexing successfully! Time indexing: " + durationSeconds + " seconds")
      }
    
    }
    
    case class IndexDocument (filePath:String, content:String)

我的spark cluster有3个节点，每个节点有2Gb Ram。mynode1运行ElasticSearch，3个节点运行hdfs

spark-default.conf

spark.master                     spark://master:7077
spark.eventLog.enabled           false
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              6g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
#spark.storage.memoryFraction     0.2
spark.executor.memory            800m
spark.kryoserializer.buffer.max  256m

我认为100MB的数据是可以的，但1Gb不是。我必须减少我的数据吗？