Apache spark 简化pyspark数据帧中的代码并减少join语句
我在pyspark中有一个数据帧,如下所示Apache spark 简化pyspark数据帧中的代码并减少join语句,apache-spark,pyspark,Apache Spark,Pyspark,我在pyspark中有一个数据帧,如下所示 full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc) final_df = full_df.join(security_df, full_df.id == security_df.id, 'full
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
df.show
电话展
个人电脑展
安全展
然后我想在所有三个数据帧上做一个完整的外部连接。我做了如下的事情
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
决赛
我可以得到我想要的,但我想简化我的代码
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
我该怎么做?任何人都可以解释。这里有一种方法,使用when.other将列映射到类别,然后将其旋转到所需的输出:
导入pyspark.sql.F函数
df.带有“cat”列,
F.whend F.device.isinphone_列表“phones”。否则
F.whendf.device.isinpc_列表“pc”。否则
F.whend F.device.isnsecurity\u列表“security”
.groupBy'id'.pivot'cat'.aggF.count'cat'.show
+--+--+---+----+
|id | pc |电话|安全|
+--+--+---+----+
| 1| 1| 2| 1|
|3 | 1 |空| 2|
|2 |空| 1 | 1|
+--+--+---+----+
这里有一个小小的疑问,如果我想在过去10天内为每个id平均安装设备,我该怎么做呢。df有过去10天的记录
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement