Python 如何在PySpark中将df列[JSON_Format]转换为多个列?
我从Kafka获得了JSON格式的数据,并将数据作为PySpark中的数据帧读取 我从卡夫卡那里得到数据后,它显示为数据帧格式:Python 如何在PySpark中将df列[JSON_Format]转换为多个列?,python,apache-spark,pyspark,apache-kafka,spark-structured-streaming,Python,Apache Spark,Pyspark,Apache Kafka,Spark Structured Streaming,我从Kafka获得了JSON格式的数据,并将数据作为PySpark中的数据帧读取 我从卡夫卡那里得到数据后,它显示为数据帧格式: DataFrame[value: string] 但是,该值包含JSON/DICT格式 打印声明并返回: def print_row(row): print(row) pass testing.writeStream.foreach(print_row).start() 如何将值(JSON)转换为数据帧列,如: col_1 timestamp
DataFrame[value: string]
但是,该值包含JSON/DICT格式
打印声明并返回:
def print_row(row):
print(row)
pass
testing.writeStream.foreach(print_row).start()
如何将值(JSON)转换为数据帧列,如:
col_1 timestamp
80.0 2020-01-13T08:58:58.164Z
可以为JSON数据集创建一个数据帧,该数据集由RDD[String]表示,每个字符串存储一个JSON对象
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
定义一个模式并解析JSON 抄袭
因为我通过下面的链接阅读了卡夫卡的数据。它返回的是数据帧格式,而不是JSON字符串。
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
# value schema: { "a": 1, "b": "string" }
schema = StructType().add("a", IntegerType()).add("b", StringType())
df.select( \
col("key").cast("string"),
from_json(col("value").cast("string"), schema))