pyspark:通过插座的火花流
我正在尝试从套接字(ps.pndsn.com)读取数据流,并将其写入temp_表以进行进一步处理,但目前我面临的问题是,作为writeStream的一部分创建的temp_表是空的,即使流正在实时发生。所以在这方面寻求帮助 下面是代码片段:pyspark:通过插座的火花流,pyspark,spark-streaming,databricks,Pyspark,Spark Streaming,Databricks,我正在尝试从套接字(ps.pndsn.com)读取数据流,并将其写入temp_表以进行进一步处理,但目前我面临的问题是,作为writeStream的一部分创建的temp_表是空的,即使流正在实时发生。所以在这方面寻求帮助 下面是代码片段: # Create DataFrame representing the stream of input streamingDF from connection to ps.pndsn.com:9999 streamingDF = spark \ .re
# Create DataFrame representing the stream of input streamingDF from connection to ps.pndsn.com:9999
streamingDF = spark \
.readStream \
.format("socket") \
.option("host", "ps.pndsn.com") \
.option("port", 9999) \
.load()
# Is this DF actually a streaming DF?
streamingDF.isStreaming
spark.conf.set("spark.sql.shuffle.partitions", "2") # keep the size of shuffles small
query = (
streamingDF
.writeStream
.format("memory")
.queryName("temp_table") # temp_table = name of the in-memory table
.outputMode("Append") # Append = OutputMode in which only the new rows in the streaming DataFrame/Dataset will be written to the sink
.start()
)
流式输出:
{'channel': 'pubnub-sensor-network',
'message': {'ambient_temperature': '1.361',
'humidity': '81.1392',
'photosensor': '758.82',
'radiation_level': '200',
'sensor_uuid': 'probe-84d85b75',
'timestamp': 1581332619},
'publisher': None,
'subscription': None,
'timetoken': 15813326199534409,
'user_metadata': None}
temp_表的输出为空