Apache spark 如何在我的pyspark代码中修复这个reducebykey转换问题？_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql

Apache spark 如何在我的pyspark代码中修复这个reducebykey转换问题？

apache-spark pyspark

Apache spark 如何在我的pyspark代码中修复这个reducebykey转换问题？,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我对如何正确获得这个值有点困惑。以下是我的样本数据： col_name,Category,SegmentID,total_cnt,PercentDistribution city,ANTIOCH,1,1,15 city,ARROYO GRANDE,1,1,15 state,CA,1,3,15 state,NZ,1,4,15 我正试图将输出数据帧获取为：我可以一直到今天。需要您的帮助。 from pyspark.sql.types import StructType,StructF

我对如何正确获得这个值有点困惑。以下是我的样本数据：

col_name,Category,SegmentID,total_cnt,PercentDistribution
city,ANTIOCH,1,1,15
city,ARROYO GRANDE,1,1,15
state,CA,1,3,15
state,NZ,1,4,15

我正试图将输出数据帧获取为：

我可以一直到今天。需要您的帮助。

    from pyspark.sql.types import StructType,StructField,StringType,IntegerType
    import json

    join_df=spark.read.csv("/tmp/testreduce.csv",inferSchema=True, header=True)
    jsonSchema = StructType([StructField("Name", StringType())
                           , StructField("Value", IntegerType())
                           , StructField("CatColName", StringType())
                           , StructField("CatColVal", StringType())
                        ])
    def reduceKeys(row1, row2):
            row1[0].update(row2[0])
            return row1

    res_df=join_df.rdd.map(lambda row: ("Segment " + str(row[2]), ({row[1]: row[3]},row[0],row[4])))\
.reduceByKey(lambda x, y: reduceKeys(x, y))\
.map(lambda row: (row[0], row[1][2],row[1][1], json.dumps(row[1][0]))).toDF(jsonSchema)

我的当前代码输出：

未根据段id和CatColName正确分组数据

问题是reduceByKey将生成的字符串

段1

考虑在内，这对于城市和州来说是相等的。如果在开始时添加

col_name

，它会按预期工作，但结果中会收到不同的名称。这可以用正则表达式更改

res_df=test_df.rdd.map（lambda行：（“段”+str（行[2]）+”+str（行[0]），（{row[1]：row[3]}，row[0]，row[4]））\
.reduceKeys（λx，y:reduceKeys（x，y））\
.map（lambda行：（行[0]，行[1][2]，行[1][1]，json.dumps（行[1][0]））.toDF（jsonSchema）.withColumn（“name”，regexp_extract（col（“name”），“（\w+\s\d+”，1））
res_df.show（truncate=False）

输出：

+---------+-----+----------+----------------------------------+
|name     |Value|CatColName|CatColVal                         |
+---------+-----+----------+----------------------------------+
|Segment 1|15   |city      |{"ANTIOCH": 1, "ARROYO GRANDE": 1}|
|Segment 1|15   |state     |{"CA": 3, "NZ": 4}                |
+---------+-----+----------+----------------------------------+

最终的regexp_提取仅用于恢复原始名称。

名称是否必须为段1？或者是否有可能添加额外的值？这是必要的，因为在创建数据帧之后，我计划生成一个json