Arrays 以JSONArray作为输入的Spark流[Pyspark]
我从EventHubs(类似于Kafka)以JSONArray的形式获取事件 例如: 我的代码如下所示:Arrays 以JSONArray作为输入的Spark流[Pyspark],arrays,json,apache-spark,pyspark,spark-streaming,Arrays,Json,Apache Spark,Pyspark,Spark Streaming,我从EventHubs(类似于Kafka)以JSONArray的形式获取事件 例如: 我的代码如下所示: from pyspark.sql.types import * from pyspark.sql.functions import * import json # defining the schema for the JSONArray read_schema = ArrayType(StructType([ StructField("guid", StringT
from pyspark.sql.types import *
from pyspark.sql.functions import *
import json
# defining the schema for the JSONArray
read_schema = ArrayType(StructType([
StructField("guid", StringType(), True),
StructField("userName", StringType(), True),
StructField("timestamp", TimestampType(), True)]))
# defining the raw dataframe from EventHubs
streamingInputDF = (spark
.readStream
.format("eventhubs")
.options(**eh_conf)
.load()
)
# defining the dataframe based on the previous one. Only need the body which contains the data
streamingBodyDF = (
streamingInputDF
.selectExpr("cast(Body as string) as json")
.select(from_json("json", read_schema)
.alias("data"))
)
streamingNewDF = (
streamingBodyDF
.select(explode_outer("data"))
)
query = (
streamingNewDF
.writeStream
.format("json")
.queryName("my_stream")
.outputMode("append")
.option("path", "/FileStore/sink_test")
.option("checkpointLocation", "/FileStore/chkpt_dir_test")
.start()
)
流运行良好,图表显示事件的绘图。
过了一会儿,我停下来检查水槽。它说我有数千行,但没有列。我的接收器中的所有JSON都是0字节的文件。
我想我对read_schema的定义和/或explode_outer函数的使用的配置有问题
这看起来很像,但解决方案对我的情况没有帮助
谢谢你的帮助
from pyspark.sql.types import *
from pyspark.sql.functions import *
import json
# defining the schema for the JSONArray
read_schema = ArrayType(StructType([
StructField("guid", StringType(), True),
StructField("userName", StringType(), True),
StructField("timestamp", TimestampType(), True)]))
# defining the raw dataframe from EventHubs
streamingInputDF = (spark
.readStream
.format("eventhubs")
.options(**eh_conf)
.load()
)
# defining the dataframe based on the previous one. Only need the body which contains the data
streamingBodyDF = (
streamingInputDF
.selectExpr("cast(Body as string) as json")
.select(from_json("json", read_schema)
.alias("data"))
)
streamingNewDF = (
streamingBodyDF
.select(explode_outer("data"))
)
query = (
streamingNewDF
.writeStream
.format("json")
.queryName("my_stream")
.outputMode("append")
.option("path", "/FileStore/sink_test")
.option("checkpointLocation", "/FileStore/chkpt_dir_test")
.start()
)