Apache spark Spark结构化流媒体在读取卡夫卡主题的流时是否存在超时问题?
我实现了一个spark作业,在结构化流媒体中使用foreachbatch读取卡夫卡主题的流Apache spark Spark结构化流媒体在读取卡夫卡主题的流时是否存在超时问题?,apache-spark,apache-kafka,spark-structured-streaming,spark-streaming-kafka,Apache Spark,Apache Kafka,Spark Structured Streaming,Spark Streaming Kafka,我实现了一个spark作业,在结构化流媒体中使用foreachbatch读取卡夫卡主题的流 val df=spark.readStream .格式(“卡夫卡”) .option(“kafka.bootstrap.servers”,“mykafka.broker.io:6667”) .选项(“订阅”、“测试主题”) .option(“kafka.security.protocol”,“SASL_SSL”) .option(“kafka.ssl.truststore.location”,“/home
val df=spark.readStream
.格式(“卡夫卡”)
.option(“kafka.bootstrap.servers”,“mykafka.broker.io:6667”)
.选项(“订阅”、“测试主题”)
.option(“kafka.security.protocol”,“SASL_SSL”)
.option(“kafka.ssl.truststore.location”,“/home/hadoop/cacerts”)
.option(“kafka.ssl.truststore.password”,tspass)
.option(“kafka.ssl.truststore.type”,“JKS”)
.option(“kafka.sasl.kerberos.service.name”,“kafka”)
.选项(“卡夫卡·萨斯勒机制”、“GSSAPI”)
.选项(“groupIdPrefix”、“我的群组ID”)
.load()
val streamservice=df.selectExpr(“转换(值为字符串)”)
.select(从json(col(“value”)、schema.as(“data”))
.选择(“数据。*”)
var stream_df=streamservice
.selectExpr(“转换(id为字符串)id”,“转换(x为int)x”)
val监控\u流=流\u df.writeStream
.trigger(trigger.ProcessingTime(“120秒”))
.foreachBatch{(batchDF:DataFrame,batchId:Long)=>
如果(!batchDF.isEmpty){}
}
.start()
.终止
我有以下问题
ERROR MicroBatchExecution: Query [id = b1f84242-d72b-4097-97c9-ee603badc484, runId = 752b0fe4-2762-4fff-8912-f4cffdbd7bdc] terminated with error
java.lang.IllegalStateException: Partition test-0's offset was changed from 1 to 0, some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
我提到“默认”是因为查询选项failOnDataLoss
默认为true
。正如在异常消息中所解释的,您可以将其设置为false以运行流式查询。此选项在中描述为:
当数据可能丢失(例如,主题被删除,或偏移量超出范围)时,是否使查询失败。这可能是一个假警报。当它不能按预期工作时,您可以禁用它
您是否尝试过测试代码并停止卡夫卡主题中的消息?发生什么事了?谢谢。我无法控制卡夫卡经纪人。它是由另一个系统提供的。您不需要控制代理。启动火花代码。然后停止卡夫卡制作人耶,问题是卡夫卡制作人也在另一个系统中。但我今天有机会测试这个场景。该主题整个上午都没有数据。我的spark作业只是等待,没有超时。当数据返回时,它将拾取所有数据。这正是我想要的。:-)谢谢你,伙计。是的,我测试了两个案例,完全符合你描述的。