Pyspark分组、连接并转换为json

Pyspark分组、连接并转换为json,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,我有两个孩子 用户测向 用户ID 用户名 地址 订单Df 用户ID 产品名称 产品描述 类别名称 类别 类别分类 价格 样本数据: 用户测向 订单df +------+-----------+-----------+------------+----------+------------+-----+ |userId|ProductName|ProductDesc|CategoryName|CategoryId|CategoryDesc|Price| +------+--------

我有两个孩子

  • 用户测向
    • 用户ID
    • 用户名
    • 地址
  • 订单Df
    • 用户ID
    • 产品名称
    • 产品描述
    • 类别名称
    • 类别
    • 类别分类
    • 价格
  • 样本数据: 用户测向

    订单df

    +------+-----------+-----------+------------+----------+------------+-----+
    |userId|ProductName|ProductDesc|CategoryName|CategoryId|CategoryDesc|Price|
    +------+-----------+-----------+------------+----------+------------+-----+
    |     1|         A1|      A1Dec|           A|         1|        Adec|    5|
    |     1|         A2|      A2Dec|           A|         1|        Adec|   10|
    |     1|         B1|      A1Dec|           B|         2|        Bdec|   11|
    |     2|         B4|      A4Dec|           B|         2|        Bdec|   15|
    +------+-----------+-----------+------------+----------+------------+-----+
    
    我需要对df进行分组和聚合(创建嵌套模式),并与用户df连接。然后为每条记录创建一个json文件

    例如:- Json 1


    连接两个数据帧并使用
    collect\u list
    收集每个用户的订单。编写json文件作为输出,并使用userId对其进行分区。将为每个用户ID创建两个文件夹,每个文件夹将包含一个json文件。Spark无法重命名或移动文件,因此您可能需要一些
    os
    操作来根据需要重命名/移动文件

    import pyspark.sql.functions as F
    
    orderdf2 = orderdf.select('userId',
        F.struct(
            F.col('ProductName').alias('name'),
            F.col('Price').alias('price'),
            F.struct(
                F.col('CategoryId').alias('Id'),
                F.col('CategoryName').alias('name'),
                F.col('CategoryDesc').alias('desc')
            ).alias('category')
        ).alias('order')
    ).groupBy('userId').agg(
        F.collect_list('order').alias('order')
    )
    
    userdf.join(
        orderdf2, 'userId'
    ).groupBy(
        'userId','name','address'
    ).agg(
        F.collect_list('order').alias('order')
    ).write.partitionBy('userId').json('result')
    
    ==>userId=1/part-00144-845806db-0700-4585-bb45-01648432abc1.c000.json userId=2/part-00189-845806db-0700-4585-bb45-01648432abc1.c000.jsonSpark sql解决方案:

     val df = spark.sql(""" with t1 (
     select  1 c1,   'Sufi' c2, 'Reons' c3  union all
     select  2 c1,   'Ragu' c2, 'Random' c3
      )  select   c1  userId,   c2  name,   c3 Addreshh    from t1
    """)
    
     val order_df = spark.sql(""" with t1 (
     select  1 c1,   'A1' c2, 'A1Dec' c3, 'A' c4, 1 c5,   'Adec' c6, 5 c7    union all
     select  1 c1,   'A2' c2, 'A2Dec' c3, 'A' c4, 1 c5,   'Adec' c6, 10 c7    union all
     select  1 c1,   'B1' c2, 'A1Dec' c3, 'B' c4, 2 c5,   'Bdec' c6, 11 c7    union all
     select  2 c1,   'B4' c2, 'A4Dec' c3, 'B' c4, 2 c5,   'Bdec' c6, 15 c7
      )  select   c1  userId,   c2  ProductName,   c3  ProductDesc,   c4  CategoryName,   c5  CategoryId,   c6  CategoryDesc,   c7 Price    from t1
    """)
    
    df.createOrReplaceTempView("cust")
    order_df.createOrReplaceTempView("order")
    
    val dj_src1 = spark.sql(""" select userId, collect_list(named_struct('name',ProductName,'price',Price,'category',category )) order from 
    ( select userId, ProductName, Price, named_struct('id', CategoryId,'name',CategoryName,'desc', CategoryDesc ) category from order ) temp
    group by 1  
    """)
    
    dj_src1.createOrReplaceTempView("src1")
    
    val dj2 = spark.sql(""" select a.userId, a.name, a.Addreshh, b.order 
    from cust a join 
    src1 b on
    a.userId=b.userid
    """)
    
    dj2.toJSON.show(false)
    
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |value                                                                                                                                                                                                                                                                   |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |{"userId":1,"name":"Sufi","Addreshh":"Reons","order":[{"name":"A1","price":5,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"A2","price":10,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"B1","price":11,"category":{"id":2,"name":"B","desc":"Bdec"}}]}|
    |{"userId":2,"name":"Ragu","Addreshh":"Random","order":[{"name":"B4","price":15,"category":{"id":2,"name":"B","desc":"Bdec"}}]}                                                                                                                                          |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    Orderdf=spark.createDataFrame([(“1”、“A1”、“A1Dec”、“A”、“1”、“Adec”,5),(“1”、“A2”、“A2Dec”、“A”、“1”、“Adec”,10),(“1”、“B1”、“A1Dec”、“B”、“2”、“Bdec”,11),(“2”、“B4”、“A4Dec”、“B”、“2”、“Bdec”,15)],[“用户ID”、“产品名称”、“产品描述”、“类别名称”、“类别ID”、“类别定义”、“类别定义”、“价格”)userdf=spark.createDataFrame([((“1”),“Sufi”、“Reons”)、(“2”、“Ragu”、“Random”)、]、[“userId”、“name”、“Address”])desc列与中的任何列都不匹配orderdf@mckdesc在category对象中。也就是CategorydescHow我可以执行左连接吗?userdf.join(orderdf,'userId','left')加入前的分组顺序df如何。我认为这将提供更好的性能。无论是先分组
    orderdf
    ,然后加入,还是先加入然后分组。我正在做后者。性能差异不应该很明显。我添加的只是示例df。就像这样,我有30个df,我需要将每个df分组并与其他df一起加入最后,我需要创建一个单json文件(每个df有数百万条记录)。最好的方法是什么。
    import pyspark.sql.functions as F
    
    orderdf2 = orderdf.select('userId',
        F.struct(
            F.col('ProductName').alias('name'),
            F.col('Price').alias('price'),
            F.struct(
                F.col('CategoryId').alias('Id'),
                F.col('CategoryName').alias('name'),
                F.col('CategoryDesc').alias('desc')
            ).alias('category')
        ).alias('order')
    ).groupBy('userId').agg(
        F.collect_list('order').alias('order')
    )
    
    userdf.join(
        orderdf2, 'userId'
    ).groupBy(
        'userId','name','address'
    ).agg(
        F.collect_list('order').alias('order')
    ).write.partitionBy('userId').json('result')
    
    ==> userId=1/part-00144-845806db-0700-4585-bb45-01648432abc1.c000.json <==
    {"name":"Sufi","address":"Reons","order":[{"name":"A1","price":5,"category":{"Id":"1","name":"A","desc":"Adec"}},{"name":"A2","price":10,"category":{"Id":"1","name":"A","desc":"Adec"}},{"name":"B1","price":11,"category":{"Id":"2","name":"B","desc":"Bdec"}}]}
    
    ==> userId=2/part-00189-845806db-0700-4585-bb45-01648432abc1.c000.json <==
    {"name":"Ragu","address":"Random","order":[{"name":"B4","price":15,"category":{"Id":"2","name":"B","desc":"Bdec"}}]}
    
     val df = spark.sql(""" with t1 (
     select  1 c1,   'Sufi' c2, 'Reons' c3  union all
     select  2 c1,   'Ragu' c2, 'Random' c3
      )  select   c1  userId,   c2  name,   c3 Addreshh    from t1
    """)
    
     val order_df = spark.sql(""" with t1 (
     select  1 c1,   'A1' c2, 'A1Dec' c3, 'A' c4, 1 c5,   'Adec' c6, 5 c7    union all
     select  1 c1,   'A2' c2, 'A2Dec' c3, 'A' c4, 1 c5,   'Adec' c6, 10 c7    union all
     select  1 c1,   'B1' c2, 'A1Dec' c3, 'B' c4, 2 c5,   'Bdec' c6, 11 c7    union all
     select  2 c1,   'B4' c2, 'A4Dec' c3, 'B' c4, 2 c5,   'Bdec' c6, 15 c7
      )  select   c1  userId,   c2  ProductName,   c3  ProductDesc,   c4  CategoryName,   c5  CategoryId,   c6  CategoryDesc,   c7 Price    from t1
    """)
    
    df.createOrReplaceTempView("cust")
    order_df.createOrReplaceTempView("order")
    
    val dj_src1 = spark.sql(""" select userId, collect_list(named_struct('name',ProductName,'price',Price,'category',category )) order from 
    ( select userId, ProductName, Price, named_struct('id', CategoryId,'name',CategoryName,'desc', CategoryDesc ) category from order ) temp
    group by 1  
    """)
    
    dj_src1.createOrReplaceTempView("src1")
    
    val dj2 = spark.sql(""" select a.userId, a.name, a.Addreshh, b.order 
    from cust a join 
    src1 b on
    a.userId=b.userid
    """)
    
    dj2.toJSON.show(false)
    
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |value                                                                                                                                                                                                                                                                   |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |{"userId":1,"name":"Sufi","Addreshh":"Reons","order":[{"name":"A1","price":5,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"A2","price":10,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"B1","price":11,"category":{"id":2,"name":"B","desc":"Bdec"}}]}|
    |{"userId":2,"name":"Ragu","Addreshh":"Random","order":[{"name":"B4","price":15,"category":{"id":2,"name":"B","desc":"Bdec"}}]}                                                                                                                                          |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+