Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 简化pyspark数据帧中的代码并减少join语句_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 简化pyspark数据帧中的代码并减少join语句

Apache spark 简化pyspark数据帧中的代码并减少join语句,apache-spark,pyspark,Apache Spark,Pyspark,我在pyspark中有一个数据帧,如下所示 full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc) final_df = full_df.join(security_df, full_df.id == security_df.id, 'full

我在pyspark中有一个数据帧,如下所示

full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)

final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
df.show

电话展

个人电脑展

安全展

然后我想在所有三个数据帧上做一个完整的外部连接。我做了如下的事情

full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)

final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
决赛

我可以得到我想要的,但我想简化我的代码

1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
我该怎么做?任何人都可以解释。

这里有一种方法,使用when.other将列映射到类别,然后将其旋转到所需的输出:

导入pyspark.sql.F函数 df.带有“cat”列, F.whend F.device.isinphone_列表“phones”。否则 F.whendf.device.isinpc_列表“pc”。否则 F.whend F.device.isnsecurity\u列表“security” .groupBy'id'.pivot'cat'.aggF.count'cat'.show +--+--+---+----+ |id | pc |电话|安全| +--+--+---+----+ | 1| 1| 2| 1| |3 | 1 |空| 2| |2 |空| 1 | 1| +--+--+---+----+
这里有一个小小的疑问,如果我想在过去10天内为每个id平均安装设备,我该怎么做呢。df有过去10天的记录
+---+--------+
| id|security|
+---+--------+
|  1|       1|
|  2|       1|
|  3|       2|
+---+--------+
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)

final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
+---+------+----+--------+
| id|phones|  pc|security|
+---+------+----+--------+
|  1|     2|   1|       1|
|  2|     1|null|       1|
|  3|  null|   1|       2|
+---+------+----+--------+
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement