Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用二进制数据获取一个字段,将其解析为Json,并使用pyspark/kafka中Json的列进一步传播?_Json_Apache Spark_Pyspark_Apache Kafka_Spark Structured Streaming - Fatal编程技术网

如何使用二进制数据获取一个字段,将其解析为Json,并使用pyspark/kafka中Json的列进一步传播?

如何使用二进制数据获取一个字段,将其解析为Json,并使用pyspark/kafka中Json的列进一步传播?,json,apache-spark,pyspark,apache-kafka,spark-structured-streaming,Json,Apache Spark,Pyspark,Apache Kafka,Spark Structured Streaming,我正在用以下代码阅读我的Kafka主题: builder = SparkSession.builder\ .appName("PythonTest08") spark = builder.getOrCreate() # Subscribe to 1 topic df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers"

我正在用以下代码阅读我的
Kafka
主题:

builder = SparkSession.builder\
   .appName("PythonTest08")

spark = builder.getOrCreate()

# Subscribe to 1 topic
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
  .option("subscribe", dataFlowTopic) \
  .load()
示例输出如下所示

Batch: 1
-------------------------------------------
+----+--------------------+--------+---------+-------+--------------------+-------------+
| key|               value|   topic|partition| offset|           timestamp|timestampType|
+----+--------------------+--------+---------+-------+--------------------+-------------+
|null|[7B 22 70 6C 75 6...|dataflow|        0|3512872|2021-04-06 00:46:...|            0|
+----+--------------------+--------+---------+-------+--------------------+-------------+
其中
value
是二进制编码的Json

如何提取它,从中推断模式,并用它的列获得新的流

我可以用Python函数映射它吗


我编写了单独的脚本从JSON示例推断模式

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Infer Schema") \
    .getOrCreate()

df = spark \
    .read \
    .option("multiline", True) \
    .json("file_examples/dataflow/row01.json")

df.printSchema()

df.show()

with open("dataflow_schema.json", "w") as fp:
    fp.write(df.schema.json())
然后我尝试在结构化流媒体中使用它

builder = SparkSession.builder\
   .appName("PythonTest10")


spark = builder.getOrCreate()


# Subscribe to 1 topic
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
  .option("subscribe", dataFlowTopic) \
  .load()

with open("dataflow_schema.json") as fp:
    schema = StructType.fromJson(json.load(fp))

df = df \
    .select(from_json(col("value").cast("string"), schema).alias("parsed_value"))

df.printSchema()

# Start running the query that prints the running counts to the console
query = df \
    .writeStream \
    .outputMode('update') \
    .format('console') \
    .start()

query.awaitTermination()
它打印以下模式

root
 |-- parsed_value: struct (nullable = true)
...
 |    |-- pluginVersion: string (nullable = true)
...
不幸的是,在发布此JSON时

>{"pluginVersion":"0.1"
...
}
(即一个JSON对象)到Kafka流,我得到以下输出

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+
|        parsed_value|
+--------------------+
|{null, null, null...|
+--------------------+

这是否意味着架构不正确或其他错误?

能否显示架构文件?完整的消息是什么?Schema还可以。我发现这是因为截断到4096。但我不知道为什么会发生截断。。。