Apache spark 从kafka的spark结构化流媒体源写入elasticsearch时出现问题
我在卡夫卡spark structured streaming Sourceing的elasticsearch中写入时遇到以下错误: 下面是示例代码:Apache spark 从kafka的spark结构化流媒体源写入elasticsearch时出现问题,apache-spark,
elasticsearch,apache-kafka,spark-structured-streaming,Apache Spark,
elasticsearch,Apache Kafka,Spark Structured Streaming,我在卡夫卡spark structured streaming Sourceing的elasticsearch中写入时遇到以下错误: 下面是示例代码: tweet_stream = spark.readStream.format("Kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe","twitter&q
tweet_stream = spark.readStream.format("Kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe","twitter").option("startingOffsets", "earliest").load()
df=tweet_stream.selectExpr("CAST(value AS STRING)")
df.writeStream.format("es").option('es.index.auto.create', 'true').option("es.port","9200").option("es.resource.write","tweet/_doc").option("checkpointLocation","C:\elk\checkpoint").option("es.nodes", "localhost").outputMode("append").start().awaitTermination()
我在本地窗口机器上运行上述代码。
在控制台上写入相同的输出时,我没有收到错误。
如果源是csv文件(通过spark结构化流媒体),我在写入elasticsearch时也不会出错
版本
Spark—2.4.7
elasticsearch-7.12.1
错误:
21/05/16 15:24:29 ERROR MicroBatchExecution: Query [id = e968184b-f88b-4ed0-9153-283271b2936d, runId = dd790bcc-b0a4-4786-b73d-5442efae8e30] terminated with error
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"logOffset":2}
at org.apache.spark.sql.kafka010.JsonUtils$.partitionOffsets(JsonUtils.scala:74)
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.deserializeOffset(KafkaMicroBatchReader.scala:174)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$5$$anonfun$apply$2$$anonfun$apply$mcV$sp$2.apply(MicroBatchExecution.scala:354)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$5$$anonfun$apply$2$$anonfun$apply$mcV$sp$2.apply(MicroBatchExecution.scala:354)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$5$$anonfun$apply$2.apply$mcV$sp(MicroBatchExecution.scala:354)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$5$$anonfun$apply$2.apply(MicroBatchExecution.scala:353)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$5$$anonfun$apply$2.apply(MicroBatchExecution.scala:353)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$5.apply(MicroBatchExecution.scala:349)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$5.apply(MicroBatchExecution.scala:341)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcZ$sp(MicroBatchExecution.scala:341)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:337)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:337)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:337)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:183)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
如果目标是从Kafka获取数据到Elasticsearch,我建议使用Kafka Connect或logstash,而不是编写Spark代码