Python PySpark:如何以干净的方式聚合json列

Python PySpark:如何以干净的方式聚合json列,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我已将嵌套的json打包为pyspark数据框架中的字符串列,并尝试基于groupBy对某些列执行UPSERT 输入: from pyspark.sql.functions import * from pyspark.sql.types import * input_json = """[{ "candidate_email": "cust1@email.com", "transactions":"[{'transaction_id':'10', 'transaction_am

我已将嵌套的json打包为pyspark数据框架中的字符串列,并尝试基于
groupBy
对某些列执行UPSERT

输入:

from pyspark.sql.functions import *
from pyspark.sql.types import *

input_json = """[{
    "candidate_email": "cust1@email.com",
    "transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]"
},
{
    "candidate_email": "cust1@email.com",
    "transactions":"[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}
]
"""
input_df = spark.read.json(sc.parallelize([input_json]), multiLine=True)
input_df.printSchema()
# root
#  |-- candidate_email: string (nullable = true)
#  |-- transactions: string (nullable = true)
变换和电流输出:

output_df = input_df.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions"))
output_df.printSchema()
output_df.collect()

# root
#  |-- candidate_email: string (nullable = true)
#  |-- transactions: array (nullable = true)
#  |    |-- element: string (containsNull = true)

# Out[161]:
# [Row(candidatey_email='cust1@email.com', transactions=["[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]", "[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"])]
但是,我应该对上述代码进行哪些更改以获得此输出:

期望输出:

output_json = """[{
    "candidate_email": "cust1@email.com",
    "transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}]"""
output_df = spark.read.json(sc.parallelize([output_json]), multiLine=True)
output_df.printSchema()
# root
#  |-- candidate_email: string (nullable = true)
#  |-- transactions: string (nullable = true)
基本上,我试图通过一个列表而不是多个列表来获得干净的合并


谢谢

由于您拥有交易列的
字符串
类型,我们需要将其转换为
数组
类型,然后通过执行
分解
我们可以达到预期的结果

示例:

df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions                                                                                                    |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1@email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}]                                                        |
#+---------------+----------------------------------------------------------------------------------------------------------------+

#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array. 
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")

df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions                                          |
#+---------------+------------------------------------------------------+
#|cust1@email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1@email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1@email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+

#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))

df2.show(10,False)


#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions                                                                                                                                                            |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json                                                                                                                                                                                                                             |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

#To the output to json file

df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")

#content of file
#{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}

#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1@email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|候选人|电子邮件|交易|
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction\u id':'10','transaction\u amount':'55.46'},{'transaction\u id':'11','transaction\u amount':'545.46'}]|
#|cust1@email.com|[{'transaction_id':'12','transaction_amount':'23.43'}]|
#+---------------+----------------------------------------------------------------------------------------------------------------+
#为了生成合适的数组,我们首先将(},)替换为(},),然后删除([|]]),并在(})上拆分,最后在数组上分解。
df1=df.selectExpr(“候选电子邮件”,“分解(拆分(regexp\u替换(事务)”(\\\},),“}},”),“(\\[\\\]]),”,“,”,“},”)为事务“”)
df1.显示(10,假)
#+---------------+------------------------------------------------------+
#|候选人|电子邮件|交易|
#+---------------+------------------------------------------------------+
#|cust1@email.com|{'transaction_id':'10','transaction_amount':'55.46'}|
#|cust1@email.com|{'transaction_id':'11','transaction_amount':'545.46'}|
#|cust1@email.com|{'transaction_id':'12','transaction_amount':'23.43'}|
#+---------------+------------------------------------------------------+
#groupBy,然后创建json对象
df2=df1.groupBy(“候选人电子邮件”)\
agg(收款清单(col(“交易”))。别名(“交易”))
df2.显示(10,错误)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|候选人|电子邮件|交易|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10','transaction_amount':'55.46'},{'transaction_id':'11','transaction_amount':'545.46'},{'transaction_id':'12','transaction_amount':'23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#在dataframe中创建json对象列
df2.selectExpr(“to_json(struct(候选者电子邮件,事务))as json”).show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{“候选人电子邮件”:cust1@email.com“,”交易“:[“{'transaction'u id':'10','transaction'u amount':'55.46'}”,“{'transaction'id':'11','transaction'amount':'545.46'}”,“{'transaction'u id':'12','transaction'u amount':'23.43'}”|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#到json文件的输出
df2.groupBy(“候选邮件”).agg(收集列表(col(“交易”)).alias(“交易”)).write.mode(“覆盖”).json(“”)
#文件内容
#{“候选人电子邮件”:cust1@email.com“,”交易“:[“{'transaction'u id':'10','transaction'u amount':'55.46'}”,“{'transaction'id':'11','transaction'amount':'545.46'}”,“{'transaction'u id':'12','transaction'u amount':'23.43'}”
#使用toJSON转换为json
df2.groupBy(“候选邮件”).agg(收集列表(col(“交易”)).alias(“交易”)).toJSON().collect()
#[u'{“候选人电子邮件”:cust1@email.com“,”交易“:[“{'transaction\u id\':'10\','transaction\u amount\':'55.46\'”,“{'transaction\u id\':'11\','transaction\u amount\':'545.46\'”,“{'transaction\u id\':'12\','transaction\u amount\':'23.43\'”]

当您拥有交易列的
字符串
类型时,我们需要将其转换为
数组
类型,然后通过执行
分解
我们可以达到预期的结果

示例:

df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions                                                                                                    |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1@email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}]                                                        |
#+---------------+----------------------------------------------------------------------------------------------------------------+

#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array. 
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")

df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions                                          |
#+---------------+------------------------------------------------------+
#|cust1@email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1@email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1@email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+

#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))

df2.show(10,False)


#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions                                                                                                                                                            |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1@email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json                                                                                                                                                                                                                             |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

#To the output to json file

df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")

#content of file
#{"candidate_email":"cust1@email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}

#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1@email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']
df.show(10,Fals