Python 结构化流作业在终止时失败
我在PySpark中有MQTT使用者,但当我将数据发送到MQTT主题时,PySpark代码失败Python 结构化流作业在终止时失败,python,pyspark,mqtt,spark-structured-streaming,Python,Pyspark,Mqtt,Spark Structured Streaming,我在PySpark中有MQTT使用者,但当我将数据发送到MQTT主题时,PySpark代码失败 from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Test") \ .getOrCreate() lines = (spark .readStream .format("org.apache.bahir.sql.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
lines = (spark
.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic","my_topic")
.load("tcp://127.0.0.1:1883"))
query = lines.writeStream.format("console").start()
lines.printSchema()
query.awaitTermination()
我不使用Spark向MQTT发送消息
import sys
import paho.mqtt.client as mqtt
import logging
def on_connect(client, userdata, flags, rc):
print('connected (%s)' % client._client_id)
def on_message(client, userdata, message):
print('------------------------------')
print('topic: %s' % message.topic)
print('payload: %s' % message.payload)
print('qos: %d' % message.qos)
def main(argv):
broker = "127.0.0.1"
port = 1883
mqttc = mqtt.Client("Test1")
print("connecting to broker ", broker)
mqttc.connect(broker, port, 60)
mqttc.on_connect = on_connect
mqttc.on_message = on_message
mqttc.subscribe("my_topic")
#mqttc.loop_start()
print("publishing ")
mqttc.publish("my_topic","{\"messageeeee\"}")
mqttc.loop_forever()
if __name__ == '__main__':
main(sys.argv[1:])
这是每次在MQTT中生成新消息时停止PySpark流作业的错误消息:
Logical Plan:
org.apache.bahir.sql.streaming.mqtt.MQTTTextStreamSource@72a6d759
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: java.lang.AssertionError: assertion failed: DataFrame returned by getBatch from org.apache.bahir.sql.streaming.mqtt.MQTTTextStreamSource@72a6d759 did not have isStreaming=true
Project [_1#10 AS value#13, _2#11 AS timestamp#14]
+- AnalysisBarrier
+- LocalRelation [_1#10, _2#11]
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:395)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mimi/test/InitiatorSpark.py", line 41, in <module>
query.awaitTermination()
逻辑计划:
org.apache.bahir.sql.streaming.mqtt。MQTTTextStreamSource@72a6d759
在org.apache.spark.sql.execution.streaming.streamingExecution.org$apache$spark$sql$execution$streaming$streaming$streamingExecution$$runStream(StreamExecution.scala:295)
位于org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
原因:java.lang.AssertionError:assertion失败:getBatch从org.apache.bahir.sql.streaming.mqtt返回的数据帧。MQTTTextStreamSource@72a6d759没有isStreaming=true
项目[_1#10作为值#13,_2#11作为时间戳#14]
+-分析屏障
+-局部关系[u 1#10,_2#11]
在scala.Predef$.assert处(Predef.scala:170)
在org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:395)
在org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
位于scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
位于scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
位于scala.collection.Iterator$class.foreach(Iterator.scala:893)
位于scala.collection.AbstractIterator.foreach(迭代器.scala:1336)
位于scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
位于org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
位于scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
位于org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
位于org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
位于org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
位于org.apache.spark.sql.execution.streaming.ProgressReporter$class.ReportTimeTake(ProgressReporter.scala:271)
位于org.apache.spark.sql.execution.streaming.StreamExecution.ReportTimeTake(StreamExecution.scala:58)
在org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
在org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
在org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
在org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
位于org.apache.spark.sql.execution.streaming.ProgressReporter$class.ReportTimeTake(ProgressReporter.scala:271)
位于org.apache.spark.sql.execution.streaming.StreamExecution.ReportTimeTake(StreamExecution.scala:58)
在org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
位于org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
位于org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
在org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$streaming$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 还有一个
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“/Users/mimi/test/InitiatorSpark.py”,第41行,在
查询
我不知道
query.com有什么问题
。我在MQTT中发送的消息格式是否不正确,或者这里到底发生了什么?您找到了解决方案吗?我也面临这个问题你的pyspark版本是什么?@RakeshRakshit:是的,我部分解决了这个问题。看看:@RakeshRakshit:但问题是我仍然无法解析MQTT队列中接收到的Json字符串。我得到空值。我在上面分享的帖子中解释了这一点。我正在使用PySpark 2.2.1。我以如下方式提交作业:~/Downloads/spark-2.2.1-bin-hadoop2.7/bin/spark-submit--jars lib/spark-streaming-mqtt_2.11-2.2.1.jar、lib/spark-sql-streaming-mqtt_2.11-2.2.1.jar、lib/org.eclipse.paho.client.mqttv3-1.2.0.jar InitiatorSpark.py