Python 3.x 使用Python使用另一个嵌套Json更新嵌套Json
例如,我有一套完整的嵌套JSON,我需要用另一套嵌套JSON的最新值更新这个JSON 有人能帮我吗 我想在Pyspark中实现这一点 整套Json如下所示:Python 3.x 使用Python使用另一个嵌套Json更新嵌套Json,python-3.x,apache-spark,pyspark,pyspark-dataframes,Python 3.x,Apache Spark,Pyspark,Pyspark Dataframes,例如,我有一套完整的嵌套JSON,我需要用另一套嵌套JSON的最新值更新这个JSON 有人能帮我吗 我想在Pyspark中实现这一点 整套Json如下所示: { "email": "abctest@xxx.com", "firstName": "name01", "id": 6304, "surname": "Optional&qu
{
"email": "abctest@xxx.com",
"firstName": "name01",
"id": 6304,
"surname": "Optional",
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
{
"email": "test@xxx.com",
"firstName": "name01",
"surname": "Optional",
"id": 6304,
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1_changedData",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
LatestJson如下所示:
{
"email": "abctest@xxx.com",
"firstName": "name01",
"id": 6304,
"surname": "Optional",
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
{
"email": "test@xxx.com",
"firstName": "name01",
"surname": "Optional",
"id": 6304,
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1_changedData",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
在上面的id=6304
中,我们收到了layer01.layer02.key1
和emailaddress
文件的更新
因此,我需要将这些值更新为完整的JSON,请帮助我。您可以将这两个JSON文件加载到Spark数据帧中,然后执行
左键连接
以获取最新JSON数据的更新:
from pyspark.sql import functions as F
full_json_df = spark.read.json(full_json_path, multiLine=True)
latest_json_df = spark.read.json(latest_json_path, multiLine=True)
updated_df = full_json_df.alias("full").join(
latest_json_df.alias("latest"),
F.col("full.id") == F.col("latest.id"),
"left"
).select(
F.col("full.id"),
*[
F.when(F.col("latest.id").isNotNull(), F.col(f"latest.{c}")).otherwise(F.col(f"full.{c}")).alias(c)
for c in full_json_df.columns if c != 'id'
]
)
updated_df.show(truncate=False)
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|id |email |firstName|layer01 |surname |
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|6304|test@xxx.com|name01 |[value1, value2, value3, value4, [value1_changedData, value2], [[inner value01,], [, inner_value02]]]|Optional|
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
更新:
{
"email": "abctest@xxx.com",
"firstName": "name01",
"id": 6304,
"surname": "Optional",
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
{
"email": "test@xxx.com",
"firstName": "name01",
"surname": "Optional",
"id": 6304,
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1_changedData",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
如果模式在完整JSON和最新JSON之间发生更改,则可以将这两个文件加载到同一数据帧中(以这种方式合并模式),然后根据id
执行重复数据消除:
from pyspark.sql import Window
from pyspark.sql import functions as F
merged_json_df = spark.read.json("/path/to/{full_json.json,latest_json.json}", multiLine=True)
# order priority: latest file then full
w = Window.partitionBy(F.col("id")).orderBy(F.when(F.input_file_name().like('%latest%'), 0).otherwise(1))
updated_df = merged_json_df.withColumn("rn", F.row_number().over(w))\
.filter("rn = 1")\
.drop("rn")
updated_df.show(truncate=False)
您是否总是希望从最新JSON版本中获取每个
id
的值,而不考虑空更新?例如@blackishop,ya基于id我们需要更新JSON,可以帮助我使用pythonThank中的示例代码作为响应,我在这里面临的另一个问题是,在最新嵌套的JSON中,我将无法获得完整的JSON,我只会得到部分json,如果你应用上面的代码,我会得到错误,比如(org.apache.spark.sql.AnalysisException:cannot resolve';latest.Document
';给定输入列:),你能帮我一下吗?我正在使用最新的json,就像这个{“email”:test@xxx.com“,”名字“:”名字01“,”姓氏“:”可选“,“id”:6304,“layer01”:{“key1”:“value1”,“key2”:“value2”,“key3”:“value3”,“key4”:“value4”,“layer02”:{“key1”:“value1\u changedData”,“key2”:“value2”},}如果我没有在最新的json中得到某个列,那么有没有可能从完整的json中提取该列,通过上面的代码full和LASTEST必须是相同的列数(我指的是相同的模式)上面的代码是失败的,如果模式是不匹配的,但对我来说,我不会得到相同的模式为最新的json,你能帮我吗this@Pradeep好的。数据中是否有任何日期可以识别最新的或完整的?或者可能使用文件名?