Python PySpark:如何以干净的方式聚合json列
我已将嵌套的json打包为pyspark数据框架中的字符串列,并尝试基于Python PySpark:如何以干净的方式聚合json列,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我已将嵌套的json打包为pyspark数据框架中的字符串列,并尝试基于groupBy对某些列执行UPSERT 输入: from pyspark.sql.functions import * from pyspark.sql.types import * input_json = """[{ "candidate_email": "cust1@email.com", "transactions":"[{'transaction_id':'10', 'transaction_am
groupBy
对某些列执行UPSERT
输入:
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_json = """[{
"candidate_email": "cust1@email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]"
},
{
"candidate_email": "cust1@email.com",
"transactions":"[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}
]
"""
input_df = spark.read.json(sc.parallelize([input_json]), multiLine=True)
input_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
变换和电流输出:
output_df = input_df.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions"))
output_df.printSchema()
output_df.collect()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: string (containsNull = true)
# Out[161]:
# [Row(candidatey_email='cust1@email.com', transactions=["[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]", "[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"])]
但是,我应该对上述代码进行哪些更改以获得此输出:
期望输出:
output_json = """[{
"candidate_email": "cust1@email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}]"""
output_df = spark.read.json(sc.parallelize([output_json]), multiLine=True)
output_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
基本上,我试图通过一个列表而不是多个列表来获得干净的合并
谢谢 由于您拥有交易列的
字符串
类型,我们需要将其转换为数组
类型,然后通过执行分解
和
我们可以达到预期的结果
示例:
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1@email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}] |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array.
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")
df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------+
#|cust1@email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1@email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1@email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+
#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))
df2.show(10,False)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#To the output to json file
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")
#content of file
#{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}
#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1@email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|候选人|电子邮件|交易|
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction\u id':'10','transaction\u amount':'55.46'},{'transaction\u id':'11','transaction\u amount':'545.46'}]|
#|cust1@email.com|[{'transaction_id':'12','transaction_amount':'23.43'}]|
#+---------------+----------------------------------------------------------------------------------------------------------------+
#为了生成合适的数组,我们首先将(},)替换为(},),然后删除([|]]),并在(})上拆分,最后在数组上分解。
df1=df.selectExpr(“候选电子邮件”,“分解(拆分(regexp\u替换(事务)”(\\\},),“}},”),“(\\[\\\]]),”,“,”,“},”)为事务“”)
df1.显示(10,假)
#+---------------+------------------------------------------------------+
#|候选人|电子邮件|交易|
#+---------------+------------------------------------------------------+
#|cust1@email.com|{'transaction_id':'10','transaction_amount':'55.46'}|
#|cust1@email.com|{'transaction_id':'11','transaction_amount':'545.46'}|
#|cust1@email.com|{'transaction_id':'12','transaction_amount':'23.43'}|
#+---------------+------------------------------------------------------+
#groupBy,然后创建json对象
df2=df1.groupBy(“候选人电子邮件”)\
agg(收款清单(col(“交易”))。别名(“交易”))
df2.显示(10,错误)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|候选人|电子邮件|交易|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10','transaction_amount':'55.46'},{'transaction_id':'11','transaction_amount':'545.46'},{'transaction_id':'12','transaction_amount':'23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#在dataframe中创建json对象列
df2.selectExpr(“to_json(struct(候选者电子邮件,事务))as json”).show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{“候选人电子邮件”:cust1@email.com“,”交易“:[“{'transaction'u id':'10','transaction'u amount':'55.46'}”,“{'transaction'id':'11','transaction'amount':'545.46'}”,“{'transaction'u id':'12','transaction'u amount':'23.43'}”|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#到json文件的输出
df2.groupBy(“候选邮件”).agg(收集列表(col(“交易”)).alias(“交易”)).write.mode(“覆盖”).json(“”)
#文件内容
#{“候选人电子邮件”:cust1@email.com“,”交易“:[“{'transaction'u id':'10','transaction'u amount':'55.46'}”,“{'transaction'id':'11','transaction'amount':'545.46'}”,“{'transaction'u id':'12','transaction'u amount':'23.43'}”
#使用toJSON转换为json
df2.groupBy(“候选邮件”).agg(收集列表(col(“交易”)).alias(“交易”)).toJSON().collect()
#[u'{“候选人电子邮件”:cust1@email.com“,”交易“:[“{'transaction\u id\':'10\','transaction\u amount\':'55.46\'”,“{'transaction\u id\':'11\','transaction\u amount\':'545.46\'”,“{'transaction\u id\':'12\','transaction\u amount\':'23.43\'”]
当您拥有交易列的字符串类型时,我们需要将其转换为数组类型,然后通过执行分解和
我们可以达到预期的结果
示例:
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1@email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}] |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array.
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")
df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------+
#|cust1@email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1@email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1@email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+
#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))
df2.show(10,False)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#To the output to json file
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")
#content of file
#{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}
#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1@email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']
df.show(10,Fals