Spark在列中拆分和解析json
我有一个PySpark数据帧:Spark在列中拆分和解析json,json,apache-spark,split,pyspark,Json,Apache Spark,Split,Pyspark,我有一个PySpark数据帧: catalogid | 1123798
catalogid | 1123798
catalogpath | [{"1123798":"Other, poets"},{"1112194":" Poetry for kids"}
使用模式:
StructType(List(StructField(catalogid,StringType,true),StructField(catalogpath,StringType,true)))
我只需要从catalogpath列中获取文本值,如下所示:
catalogid | 1123798
catalog_desc| "Other, poets"; "Poetry for kids"
一个带有字符串函数的简单udf函数应该可以满足您的需要
import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def parseString(str):
return ";".join([re.sub("[}\{]", "", x[x.index(":")+1:]) for x in str.split("},{")])
parseUdf = f.udf(parseString, t.StringType())
df.withColumn('1123798', parseUdf(df['1123798'])).show(truncate=False)
应该给你什么
+-----------+----------------------------------+
|catalogid |1123798 |
+-----------+----------------------------------+
|catalogpath|"Other, poets";" Poetry for kids"|
+-----------+----------------------------------+
我希望答案有帮助您可以使用JSON解析器:
import json
from itertools import chain
from pyspark.sql.functions import udf, concat_ws
@udf("array<string>")
def parse(s):
try:
return list(chain.from_iterable(x.values() for x in json.loads(s)))
except:
pass
df = spark.createDataFrame(
[(1123798, """[{"1123798":"Other, poets"},{"1112194":" Poetry for kids"}]""")],
("catalogid", "catalogpath")
)
result = df.select("catalogid", parse("catalogpath").alias("catalog_desc"))
result.show(truncate=False)
# +---------+----------------------------------+
# |catalogid|catalog_desc |
# +---------+----------------------------------+
# |1123798 |[Other, poets, Poetry for kids]|
# +---------+----------------------------------+
result.withColumn("catalog_desc", concat_ws(";", "catalog_desc")).show(truncate=False)
# +---------+-------------------------------+
# |catalogid|catalog_desc |
# +---------+-------------------------------+
# |1123798 |Other, poets; Poetry for kids|
# +---------+-------------------------------+