使用scala spark从hdfs读取并写入kafka,但获取NullPointerException

使用scala spark从hdfs读取并写入kafka,但获取NullPointerException,scala,apache-spark,apache-kafka,Scala,Apache Spark,Apache Kafka,我想使用spark rdd从hdfs读取文本文件,并通过foreach将其写入kafka def main(args: Array[String]): Unit = { val kafkaItemProducerConfig = { val p = new Properties() p.put("bootstrap.servers", KAFKA_SERVER) p.put("key.serializer", "org.apache.kafka.

我想使用spark rdd从hdfs读取文本文件,并通过foreach将其写入kafka

  def main(args: Array[String]): Unit = {

    val kafkaItemProducerConfig = {
      val p = new Properties()
      p.put("bootstrap.servers", KAFKA_SERVER)
      p.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
      p.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
      p.put("client.id", KAFKA_ITEM_CLIENT_ID)
      p.put("acks", "all")
      p
    }
    val conf: SparkConf = new SparkConf().setAppName("esc")
    val sc: SparkContext = new SparkContext(conf)
    kafkaItemProducer = new KafkaProducer[String, String](kafkaItemProducerConfig)
    if (kafkaItemProducer == null) {
      println("kafka config is error")
      sys.exit(1)
    }
    val dataToDmp = sc.textFile("/home/********/" + args(0) +"/part*")
    dataToDmp.foreach(x => {
      if (x != null && !x.isEmpty) {
        kafkaItemProducer.send(new ProducerRecord[String, String](KAFKA_TOPIC_ITEM, x.toString))
      }
    }
    )
    kafkaItemProducer.close()
  }
我很确定KAFKA_服务器和KAFKA_项目的客户端ID和KAFKA_主题的项目是正确的。但它有一个错误:

 ERROR ApplicationMaster [Driver]: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 0.0 failed 4 times, most recent failure: Lost task 12.3 in stage 0.0 (TID 18, tjtx148-5-173.58os.org, executor 1): java.lang.NullPointerException
    at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:56)
    at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:53)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
    at org.apache.spark.scheduler.Task.run(Task.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
上面说

    at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:56)
    at esc.HdfsWriteToKafks$$anonfun$main$1.apply(HdfsWriteToKafks.scala:53)
所以,56行中有一个错误

kafkaItemProducer.send(new ProducerRecord[String, String](KAFKA_TOPIC_ITEM, x.toString))
第53行是

dataToDmp.foreach(
我检查了dataToDmp是否有内容,我使用了
dataToDmp.take(100).foreach(println())来检查,它是正确的


我的代码中有错误吗?

它起作用了。我改为使用foreachpartition方法而不是foreach。在每个分区中,我创建一个生产者。 代码如下:

  def main(args: Array[String]): Unit = {


    var kafkaItemProducer: KafkaProducer[String, String] = null

    val KAFKA_ITEM_CLIENT_ID = "a"
    val KAFKA_TOPIC_ITEM = "b"
    val KAFKA_SERVER = "host:port"

    val kafkaItemProducerConfig = {
      val p = new Properties()
      p.put("bootstrap.servers", KAFKA_SERVER)
      p.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
      p.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
      p.put("client.id", KAFKA_ITEM_CLIENT_ID)
      p.put("acks", "all")
      p
    }
    val conf: SparkConf = new SparkConf().setAppName("esc")
    val sc: SparkContext = new SparkContext(conf)



    val dataToDmp = sc.textFile("/home/****/" + args(0) +"/part*",5)
    dataToDmp.foreachPartition(partition =>{
      val kafkaItemProducer = new KafkaProducer[String, String](kafkaItemProducerConfig)
      partition.foreach(
        x=>{ kafkaItemProducer.send(new ProducerRecord[String, String](KAFKA_TOPIC_ITEM, x.toString))
          Thread.sleep(100)
        }
      )
      kafkaItemProducer.close()
    })

  }

我将在第56行添加一个关于kafka生产者的断言,因为您正在顶层实例化它(在主节点中执行),然后在foreach中的函数中使用它,该函数被序列化,然后分发以在执行器节点上执行。我很惊讶您没有收到序列化异常tbh。为每个RDD建立外部连接时,一个常见的Spark工作流是仅实例化函数中的连接/生产者。如果在集群上运行此代码,生产者可能为null。您应该使用
dataToDmp.forEachPartition
,然后为每个分区创建一个新的生产者!我改变了我使用foreachpartition方法而不是foreach。在每个分区中,我创建一个生产者。谢谢@LiamClarke,@cricket_007@cricket_007谢谢你的建议!成功了@LiamClarke。我将KAFKA_服务器和KAFKA_项_客户端_ID和KAFKA_主题_项定义移动到main中,正如您所说,我收到一个序列化异常。开始时,KAFKA_服务器和KAFKA_项_客户端_ID和KAFKA_主题_项定义在main的外侧,但在对象类中。