pyspark可分解dict列表,并根据dict键对其进行分组
我有一个由字典列表组成的数据框架,希望拆分每个字典并基于其中一个键值创建一行 样本数据:pyspark可分解dict列表,并根据dict键对其进行分组,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我有一个由字典列表组成的数据框架,希望拆分每个字典并基于其中一个键值创建一行 样本数据: [{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}] +-------------------------------------------------------------------------------------
[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
|4B|[] |
+---------------------------------------------------------------------------------------------------------------------------+
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4B|""|"" |
+-----------------------------------------------------------------+
输入数据帧:
[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
|4B|[] |
+---------------------------------------------------------------------------------------------------------------------------+
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4B|""|"" |
+-----------------------------------------------------------------+
预期结果:
[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
|4B|[] |
+---------------------------------------------------------------------------------------------------------------------------+
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4B|""|"" |
+-----------------------------------------------------------------+
尝试为数据集col和单独数据创建架构,但不确定是否将它们分组并基于col.1值合并它们
schema=spark.read.json(df.rdd.map(lambda行:row.dataset)).schema
也提到
提前谢谢
已编辑
df2.withColumn("col", f.to_json(f.struct("`col.1`","`col.2`"))).show(truncate=False)
结果:
+---+-----+-----+-----+-----+-------------------------------+
|ID |col.1|col.2|col.3|col.4|col |
+---+-----+-----+-----+-----+-------------------------------+
|4A |12ABC|141 | |ABCD |{"col.1":"12ABC","col.2":"141"}|
|4A |13ABC|141 | |ABCD |{"col.1":"13ABC","col.2":"141"}|
+---+-----+-----+-----+-----+-------------------------------+
我尝试了下面的方法,效果很好。不确定它是否是优化的方法,开放的投入和改进 输入测向
我们可以使用posexplode和concat_ws选项代替正则表达式
df2.select(
"*",
F.posexplode(F.split("DATASET", ",")).alias("pos", "token")
)\
.where("pos > 0")\
.groupBy("ID", "DATASET")\
.agg(F.concat_ws("_" ,F.collect_list("token")).alias("data_cols"))\
.select(
"ID",
F.split("DATASET", ",").getItem(0).alias("col_id"),
"data"
)\
.show(truncate=False)