Python 如何在PySpark中将df列[JSON_Format]转换为多个列？_Python_Apache Spark_Pyspark_Apache Kafka_Spark Structured Streaming

Python 如何在PySpark中将df列[JSON_Format]转换为多个列？

python apache-spark pyspark apache-kafka

Python 如何在PySpark中将df列[JSON_Format]转换为多个列？,python,apache-spark,pyspark,apache-kafka,spark-structured-streaming,Python,Apache Spark,Pyspark,Apache Kafka,Spark Structured Streaming,我从Kafka获得了JSON格式的数据，并将数据作为PySpark中的数据帧读取我从卡夫卡那里得到数据后，它显示为数据帧格式： DataFrame[value: string] 但是，该值包含JSON/DICT格式打印声明并返回： def print_row(row): print(row) pass testing.writeStream.foreach(print_row).start() 如何将值（JSON）转换为数据帧列，如： col_1 timestamp

我从Kafka获得了JSON格式的数据，并将数据作为PySpark中的数据帧读取

我从卡夫卡那里得到数据后，它显示为数据帧格式：

DataFrame[value: string]

但是，该值包含JSON/DICT格式

打印声明并返回：

def print_row(row):
    print(row)
    pass

testing.writeStream.foreach(print_row).start()

如何将值（JSON）转换为数据帧列，如：

col_1  timestamp
80.0   2020-01-13T08:58:58.164Z

可以为JSON数据集创建一个数据帧，该数据集由RDD[String]表示，每个字符串存储一个JSON对象

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

定义一个模式并解析JSON

抄袭

因为我通过下面的链接阅读了卡夫卡的数据。它返回的是数据帧格式，而不是JSON字符串。

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

# value schema: { "a": 1, "b": "string" }
schema = StructType().add("a", IntegerType()).add("b", StringType())
df.select( \
  col("key").cast("string"),
  from_json(col("value").cast("string"), schema))