Arrays 以JSONArray作为输入的Spark流[Pyspark]_Arrays_Json_Apache Spark_Pyspark_Spark Streaming

Arrays 以JSONArray作为输入的Spark流[Pyspark]

arrays json apache-spark pyspark

Arrays 以JSONArray作为输入的Spark流[Pyspark],arrays,json,apache-spark,pyspark,spark-streaming,Arrays,Json,Apache Spark,Pyspark,Spark Streaming,我从EventHubs（类似于Kafka）以JSONArray的形式获取事件例如：我的代码如下所示： from pyspark.sql.types import * from pyspark.sql.functions import * import json # defining the schema for the JSONArray read_schema = ArrayType(StructType([ StructField("guid", StringT

我从EventHubs（类似于Kafka）以JSONArray的形式获取事件例如：

我的代码如下所示：

from pyspark.sql.types import *
from pyspark.sql.functions import *
import json

# defining the schema for the JSONArray
read_schema = ArrayType(StructType([
  StructField("guid", StringType(), True),
  StructField("userName", StringType(), True),
  StructField("timestamp", TimestampType(), True)]))

# defining the raw dataframe from EventHubs
streamingInputDF = (spark
    .readStream                       
    .format("eventhubs")
    .options(**eh_conf)
    .load()
)

# defining the dataframe based on the previous one. Only need the body which contains the data
streamingBodyDF = (
  streamingInputDF
    .selectExpr("cast(Body as string) as json")
    .select(from_json("json", read_schema)
    .alias("data"))
)

streamingNewDF = (
  streamingBodyDF
    .select(explode_outer("data"))
)


query = (
  streamingNewDF 
    .writeStream
    .format("json")        
    .queryName("my_stream")     
    .outputMode("append")
    .option("path", "/FileStore/sink_test")
    .option("checkpointLocation", "/FileStore/chkpt_dir_test")
    .start()
)

流运行良好，图表显示事件的绘图。过了一会儿，我停下来检查水槽。它说我有数千行，但没有列。我的接收器中的所有JSON都是0字节的文件。我想我对read_schema的定义和/或explode_outer函数的使用的配置有问题

这看起来很像，但解决方案对我的情况没有帮助

谢谢你的帮助

from pyspark.sql.types import *
from pyspark.sql.functions import *
import json

# defining the schema for the JSONArray
read_schema = ArrayType(StructType([
  StructField("guid", StringType(), True),
  StructField("userName", StringType(), True),
  StructField("timestamp", TimestampType(), True)]))

# defining the raw dataframe from EventHubs
streamingInputDF = (spark
    .readStream                       
    .format("eventhubs")
    .options(**eh_conf)
    .load()
)

# defining the dataframe based on the previous one. Only need the body which contains the data
streamingBodyDF = (
  streamingInputDF
    .selectExpr("cast(Body as string) as json")
    .select(from_json("json", read_schema)
    .alias("data"))
)

streamingNewDF = (
  streamingBodyDF
    .select(explode_outer("data"))
)


query = (
  streamingNewDF 
    .writeStream
    .format("json")        
    .queryName("my_stream")     
    .outputMode("append")
    .option("path", "/FileStore/sink_test")
    .option("checkpointLocation", "/FileStore/chkpt_dir_test")
    .start()
)