如何使用二进制数据获取一个字段,将其解析为Json,并使用pyspark/kafka中Json的列进一步传播?
我正在用以下代码阅读我的如何使用二进制数据获取一个字段,将其解析为Json,并使用pyspark/kafka中Json的列进一步传播?,json,apache-spark,pyspark,apache-kafka,spark-structured-streaming,Json,Apache Spark,Pyspark,Apache Kafka,Spark Structured Streaming,我正在用以下代码阅读我的Kafka主题: builder = SparkSession.builder\ .appName("PythonTest08") spark = builder.getOrCreate() # Subscribe to 1 topic df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers"
Kafka
主题:
builder = SparkSession.builder\
.appName("PythonTest08")
spark = builder.getOrCreate()
# Subscribe to 1 topic
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", dataFlowTopic) \
.load()
示例输出如下所示
Batch: 1
-------------------------------------------
+----+--------------------+--------+---------+-------+--------------------+-------------+
| key| value| topic|partition| offset| timestamp|timestampType|
+----+--------------------+--------+---------+-------+--------------------+-------------+
|null|[7B 22 70 6C 75 6...|dataflow| 0|3512872|2021-04-06 00:46:...| 0|
+----+--------------------+--------+---------+-------+--------------------+-------------+
其中value
是二进制编码的Json
如何提取它,从中推断模式,并用它的列获得新的流
我可以用Python函数映射它吗
我编写了单独的脚本从JSON示例推断模式
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Infer Schema") \
.getOrCreate()
df = spark \
.read \
.option("multiline", True) \
.json("file_examples/dataflow/row01.json")
df.printSchema()
df.show()
with open("dataflow_schema.json", "w") as fp:
fp.write(df.schema.json())
然后我尝试在结构化流媒体中使用它
builder = SparkSession.builder\
.appName("PythonTest10")
spark = builder.getOrCreate()
# Subscribe to 1 topic
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", dataFlowTopic) \
.load()
with open("dataflow_schema.json") as fp:
schema = StructType.fromJson(json.load(fp))
df = df \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
df.printSchema()
# Start running the query that prints the running counts to the console
query = df \
.writeStream \
.outputMode('update') \
.format('console') \
.start()
query.awaitTermination()
它打印以下模式
root
|-- parsed_value: struct (nullable = true)
...
| |-- pluginVersion: string (nullable = true)
...
不幸的是,在发布此JSON时
>{"pluginVersion":"0.1"
...
}
(即一个JSON对象)到Kafka流,我得到以下输出
-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+
| parsed_value|
+--------------------+
|{null, null, null...|
+--------------------+
这是否意味着架构不正确或其他错误?能否显示架构文件?完整的消息是什么?Schema还可以。我发现这是因为截断到4096。但我不知道为什么会发生截断。。。