Pyspark分组、连接并转换为json
我有两个孩子Pyspark分组、连接并转换为json,json,apache-spark,pyspark,apache-spark-sql,Json,Apache Spark,Pyspark,Apache Spark Sql,我有两个孩子 用户测向 用户ID 用户名 地址 订单Df 用户ID 产品名称 产品描述 类别名称 类别 类别分类 价格 样本数据: 用户测向 订单df +------+-----------+-----------+------------+----------+------------+-----+ |userId|ProductName|ProductDesc|CategoryName|CategoryId|CategoryDesc|Price| +------+--------
- 用户ID
- 用户名
- 地址
- 用户ID
- 产品名称
- 产品描述
- 类别名称
- 类别
- 类别分类
- 价格
+------+-----------+-----------+------------+----------+------------+-----+
|userId|ProductName|ProductDesc|CategoryName|CategoryId|CategoryDesc|Price|
+------+-----------+-----------+------------+----------+------------+-----+
| 1| A1| A1Dec| A| 1| Adec| 5|
| 1| A2| A2Dec| A| 1| Adec| 10|
| 1| B1| A1Dec| B| 2| Bdec| 11|
| 2| B4| A4Dec| B| 2| Bdec| 15|
+------+-----------+-----------+------------+----------+------------+-----+
我需要对df进行分组和聚合(创建嵌套模式),并与用户df连接。然后为每条记录创建一个json文件
例如:-
Json 1
连接两个数据帧并使用
collect\u list
收集每个用户的订单。编写json文件作为输出,并使用userId对其进行分区。将为每个用户ID创建两个文件夹,每个文件夹将包含一个json文件。Spark无法重命名或移动文件,因此您可能需要一些os
操作来根据需要重命名/移动文件
import pyspark.sql.functions as F
orderdf2 = orderdf.select('userId',
F.struct(
F.col('ProductName').alias('name'),
F.col('Price').alias('price'),
F.struct(
F.col('CategoryId').alias('Id'),
F.col('CategoryName').alias('name'),
F.col('CategoryDesc').alias('desc')
).alias('category')
).alias('order')
).groupBy('userId').agg(
F.collect_list('order').alias('order')
)
userdf.join(
orderdf2, 'userId'
).groupBy(
'userId','name','address'
).agg(
F.collect_list('order').alias('order')
).write.partitionBy('userId').json('result')
==>userId=1/part-00144-845806db-0700-4585-bb45-01648432abc1.c000.json userId=2/part-00189-845806db-0700-4585-bb45-01648432abc1.c000.jsonSpark sql解决方案:
val df = spark.sql(""" with t1 (
select 1 c1, 'Sufi' c2, 'Reons' c3 union all
select 2 c1, 'Ragu' c2, 'Random' c3
) select c1 userId, c2 name, c3 Addreshh from t1
""")
val order_df = spark.sql(""" with t1 (
select 1 c1, 'A1' c2, 'A1Dec' c3, 'A' c4, 1 c5, 'Adec' c6, 5 c7 union all
select 1 c1, 'A2' c2, 'A2Dec' c3, 'A' c4, 1 c5, 'Adec' c6, 10 c7 union all
select 1 c1, 'B1' c2, 'A1Dec' c3, 'B' c4, 2 c5, 'Bdec' c6, 11 c7 union all
select 2 c1, 'B4' c2, 'A4Dec' c3, 'B' c4, 2 c5, 'Bdec' c6, 15 c7
) select c1 userId, c2 ProductName, c3 ProductDesc, c4 CategoryName, c5 CategoryId, c6 CategoryDesc, c7 Price from t1
""")
df.createOrReplaceTempView("cust")
order_df.createOrReplaceTempView("order")
val dj_src1 = spark.sql(""" select userId, collect_list(named_struct('name',ProductName,'price',Price,'category',category )) order from
( select userId, ProductName, Price, named_struct('id', CategoryId,'name',CategoryName,'desc', CategoryDesc ) category from order ) temp
group by 1
""")
dj_src1.createOrReplaceTempView("src1")
val dj2 = spark.sql(""" select a.userId, a.name, a.Addreshh, b.order
from cust a join
src1 b on
a.userId=b.userid
""")
dj2.toJSON.show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"userId":1,"name":"Sufi","Addreshh":"Reons","order":[{"name":"A1","price":5,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"A2","price":10,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"B1","price":11,"category":{"id":2,"name":"B","desc":"Bdec"}}]}|
|{"userId":2,"name":"Ragu","Addreshh":"Random","order":[{"name":"B4","price":15,"category":{"id":2,"name":"B","desc":"Bdec"}}]} |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Orderdf=spark.createDataFrame([(“1”、“A1”、“A1Dec”、“A”、“1”、“Adec”,5),(“1”、“A2”、“A2Dec”、“A”、“1”、“Adec”,10),(“1”、“B1”、“A1Dec”、“B”、“2”、“Bdec”,11),(“2”、“B4”、“A4Dec”、“B”、“2”、“Bdec”,15)],[“用户ID”、“产品名称”、“产品描述”、“类别名称”、“类别ID”、“类别定义”、“类别定义”、“价格”)userdf=spark.createDataFrame([((“1”),“Sufi”、“Reons”)、(“2”、“Ragu”、“Random”)、]、[“userId”、“name”、“Address”])desc列与中的任何列都不匹配orderdf@mckdesc在category对象中。也就是CategorydescHow我可以执行左连接吗?userdf.join(orderdf,'userId','left')加入前的分组顺序df如何。我认为这将提供更好的性能。无论是先分组orderdf
,然后加入,还是先加入然后分组。我正在做后者。性能差异不应该很明显。我添加的只是示例df。就像这样,我有30个df,我需要将每个df分组并与其他df一起加入最后,我需要创建一个单json文件(每个df有数百万条记录)。最好的方法是什么。
import pyspark.sql.functions as F
orderdf2 = orderdf.select('userId',
F.struct(
F.col('ProductName').alias('name'),
F.col('Price').alias('price'),
F.struct(
F.col('CategoryId').alias('Id'),
F.col('CategoryName').alias('name'),
F.col('CategoryDesc').alias('desc')
).alias('category')
).alias('order')
).groupBy('userId').agg(
F.collect_list('order').alias('order')
)
userdf.join(
orderdf2, 'userId'
).groupBy(
'userId','name','address'
).agg(
F.collect_list('order').alias('order')
).write.partitionBy('userId').json('result')
==> userId=1/part-00144-845806db-0700-4585-bb45-01648432abc1.c000.json <==
{"name":"Sufi","address":"Reons","order":[{"name":"A1","price":5,"category":{"Id":"1","name":"A","desc":"Adec"}},{"name":"A2","price":10,"category":{"Id":"1","name":"A","desc":"Adec"}},{"name":"B1","price":11,"category":{"Id":"2","name":"B","desc":"Bdec"}}]}
==> userId=2/part-00189-845806db-0700-4585-bb45-01648432abc1.c000.json <==
{"name":"Ragu","address":"Random","order":[{"name":"B4","price":15,"category":{"Id":"2","name":"B","desc":"Bdec"}}]}
val df = spark.sql(""" with t1 (
select 1 c1, 'Sufi' c2, 'Reons' c3 union all
select 2 c1, 'Ragu' c2, 'Random' c3
) select c1 userId, c2 name, c3 Addreshh from t1
""")
val order_df = spark.sql(""" with t1 (
select 1 c1, 'A1' c2, 'A1Dec' c3, 'A' c4, 1 c5, 'Adec' c6, 5 c7 union all
select 1 c1, 'A2' c2, 'A2Dec' c3, 'A' c4, 1 c5, 'Adec' c6, 10 c7 union all
select 1 c1, 'B1' c2, 'A1Dec' c3, 'B' c4, 2 c5, 'Bdec' c6, 11 c7 union all
select 2 c1, 'B4' c2, 'A4Dec' c3, 'B' c4, 2 c5, 'Bdec' c6, 15 c7
) select c1 userId, c2 ProductName, c3 ProductDesc, c4 CategoryName, c5 CategoryId, c6 CategoryDesc, c7 Price from t1
""")
df.createOrReplaceTempView("cust")
order_df.createOrReplaceTempView("order")
val dj_src1 = spark.sql(""" select userId, collect_list(named_struct('name',ProductName,'price',Price,'category',category )) order from
( select userId, ProductName, Price, named_struct('id', CategoryId,'name',CategoryName,'desc', CategoryDesc ) category from order ) temp
group by 1
""")
dj_src1.createOrReplaceTempView("src1")
val dj2 = spark.sql(""" select a.userId, a.name, a.Addreshh, b.order
from cust a join
src1 b on
a.userId=b.userid
""")
dj2.toJSON.show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"userId":1,"name":"Sufi","Addreshh":"Reons","order":[{"name":"A1","price":5,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"A2","price":10,"category":{"id":1,"name":"A","desc":"Adec"}},{"name":"B1","price":11,"category":{"id":2,"name":"B","desc":"Bdec"}}]}|
|{"userId":2,"name":"Ragu","Addreshh":"Random","order":[{"name":"B4","price":15,"category":{"id":2,"name":"B","desc":"Bdec"}}]} |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+