Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark PySpark ETL更新数据帧_Apache Spark_Pyspark_Etl - Fatal编程技术网

Apache spark PySpark ETL更新数据帧

Apache spark PySpark ETL更新数据帧,apache-spark,pyspark,etl,Apache Spark,Pyspark,Etl,我有一个Pyspark数据帧,我想根据密钥将“目标”数据帧与“staging”数据帧进行更新。。。 在Pyspark中,哪种方法是最好的优化方法 target +---+-----------------------+------+------+ |key|updated_timestamp |field0|field1| +---+-----------------------+------+------+ |005|2019-10-26 21:02:30.638|cdao |co

我有一个Pyspark数据帧,我想根据密钥将“目标”数据帧与“staging”数据帧进行更新。。。 在Pyspark中,哪种方法是最好的优化方法

target
+---+-----------------------+------+------+
|key|updated_timestamp      |field0|field1|
+---+-----------------------+------+------+
|005|2019-10-26 21:02:30.638|cdao  |coaame|
|001|2019-10-22 13:02:30.638|aaaaaa|fsdc  |
|002|2019-12-22 11:42:30.638|stfi  |?     |
|004|2019-10-21 14:02:30.638|ct    |ome   |
|003|2019-10-24 21:02:30.638|io    |me    |
+---+-----------------------+------+------+

staging
+---+-----------------------+----------+---------+
|key|updated_timestamp      |field0    |field1   |
+---+-----------------------+----------+---------+
|006|2020-03-06 01:42:30.638|new record|xxaaame  |
|005|2019-10-29 09:42:30.638|cwwwwdao  |coaaaaame|
|004|2019-10-29 21:03:35.638|cwwwwdao  |coaaaaame|
+---+-----------------------+----------+---------+

output dataframe

+---+-----------------------+----------+---------+
|key|updated_timestamp      |field0    |field1   |
+---+-----------------------+----------+---------+
|005|2019-10-29 09:42:30.638|cwwwwdao  |coaaaaame|
|001|2019-10-22 13:02:30.638|aaaaaa    |fsdc     |
|002|2019-12-22 11:42:30.638|stfi      |?        |
|004|2019-10-29 21:03:35.638|cwwwwdao  |coaaaaame|
|003|2019-10-24 21:02:30.638|io        |me       |
|006|2020-03-06 01:42:30.638|new record|xxaaame  |
+---+-----------------------+----------+---------+

有几种方法可以实现这一点。下面是一个使用完全外部连接的连接:

从pyspark.sql导入函数为F
输出=staging.join(
目标,,
在class='key'上,
多满
).选择(
*(
F.合并(暂存[col],目标[col])。别名(col)
为上校
在staging.columns中
)
)

仅当更新后的值不是
NULL

时,此选项才起作用。另一种解决方案使用
union

output=staging.union(
target.join(
舞台,
on=“key”,
how=“left\u anti”
)
)

欢迎来到StackOverflow!!你能不能就具体需要做什么增加一些背景知识?此外,是否可以添加任何输出数据帧?