Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
pyspark可分解dict列表,并根据dict键对其进行分组_Pyspark_Pyspark Sql_Pyspark Dataframes - Fatal编程技术网

pyspark可分解dict列表,并根据dict键对其进行分组

pyspark可分解dict列表,并根据dict键对其进行分组,pyspark,pyspark-sql,pyspark-dataframes,Pyspark,Pyspark Sql,Pyspark Dataframes,我有一个由字典列表组成的数据框架,希望拆分每个字典并基于其中一个键值创建一行 样本数据: [{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}] +-------------------------------------------------------------------------------------

我有一个由字典列表组成的数据框架,希望拆分每个字典并基于其中一个键值创建一行

样本数据:

[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
|4B|[]                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------+
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col                                           |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4B|""|""                                                         |
+-----------------------------------------------------------------+
输入数据帧:

[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
|4B|[]                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------+
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col                                           |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4B|""|""                                                         |
+-----------------------------------------------------------------+
预期结果:

[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
|4B|[]                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------+
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col                                           |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4B|""|""                                                         |
+-----------------------------------------------------------------+
尝试为数据集col和单独数据创建架构,但不确定是否将它们分组并基于col.1值合并它们

schema=spark.read.json(df.rdd.map(lambda行:row.dataset)).schema

也提到

提前谢谢

已编辑

df2.withColumn("col", f.to_json(f.struct("`col.1`","`col.2`"))).show(truncate=False)
结果:

+---+-----+-----+-----+-----+-------------------------------+
|ID |col.1|col.2|col.3|col.4|col                            |
+---+-----+-----+-----+-----+-------------------------------+
|4A |12ABC|141  |     |ABCD |{"col.1":"12ABC","col.2":"141"}|
|4A |13ABC|141  |     |ABCD |{"col.1":"13ABC","col.2":"141"}|
+---+-----+-----+-----+-----+-------------------------------+

我尝试了下面的方法,效果很好。不确定它是否是优化的方法,开放的投入和改进

输入测向
我们可以使用posexplode和concat_ws选项代替正则表达式

df2.select(
        "*",
        F.posexplode(F.split("DATASET", ",")).alias("pos", "token")
    )\
    .where("pos > 0")\
    .groupBy("ID", "DATASET")\
    .agg(F.concat_ws("_" ,F.collect_list("token")).alias("data_cols"))\
    .select(
        "ID",
        F.split("DATASET", ",").getItem(0).alias("col_id"),
        "data"
    )\
    .show(truncate=False)