Sql 通过连接具有不同行数和多列的dataframe来填充空值_Sql_Dataframe_Apache Spark_Pyspark_Apache Spark Sql

Sql 通过连接具有不同行数和多列的dataframe来填充空值

sql dataframe apache-spark pyspark

Sql 通过连接具有不同行数和多列的dataframe来填充空值,sql,dataframe,apache-spark,pyspark,apache-spark-sql,Sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我尝试过搜索，但尽管我得到了类似的场景，但我没有找到我要找的内容我有以下两个数据帧： +---------------------------+ | ID| Value| type | +---------------------------+ | user0| 100 | Car | | user1| 102 | Car | | user2| 109 | Dog | | user3| 103 | NA

我尝试过搜索，但尽管我得到了类似的场景，但我没有找到我要找的内容

我有以下两个数据帧：

+---------------------------+
|   ID|       Value|   type |
+---------------------------+
|  user0|     100  |   Car  |
|  user1|     102  |   Car  |
|  user2|     109  |   Dog  |
|  user3|     103  |   NA   |
|  user4|     110  |   Dog  |
|  user5|     null |   null |
|  user6|     null |   null |
|  user7|     null |   null |
+---------------------------+

+---------------------------+
|   ID2|     Value2|  type2|
+---------------------------+
|  user5|     115  |  Cell  |
|  user6|     103  |  Cell  |
|  user7|     100  |  Fridge|
+---------------------------+

我想加入这两个团队，并将其作为结果：

+---------------------------+
|   ID|       Value|   type |
+---------------------------+
|  user0|     100  |   Car  |
|  user1|     102  |   Car  |
|  user2|     109  |   Dog  |
|  user3|     103  |   NA   |
|  user4|     110  |   Dog  |
|  user5|     115  |   Cell |
|  user6|     103  |   Cell |
|  user7|     100  | Fridge |
+---------------------------+

我尝试了以下方法，但没有达到预期效果：

df_joined= df1.join(df2,(df1.id==df2.id2) &
                      (df1.value==df2.value2) &
                     (df1.type==df2.type2),
                      "left").drop('id2','value2','type2')

我只从第一个df中获取值，可能left不是right连接类型，但我不知道应该使用什么。

您只需要使用ID连接，而不是其他列，因为其他列不相同。要组合其他列，请使用

coalesce

，这将给出第一个非空值

import pyspark.sql.functions as F

df_joined = df1.join(df2, df1.ID == df2.ID2, 'left').select(
    'ID',
    F.coalesce(df1.Value, df2.Value2).alias('Value'),
    F.coalesce(df1.type, df2.type2).alias('type')
)

df_joined.show()
+-----+-----+------+
|   ID|Value|  type|
+-----+-----+------+
|user0|  100|   Car|
|user1|  102|   Car|
|user2|  109|   Dog|
|user3|  103|    NA|
|user4|  110|   Dog|
|user5|  115|  Cell|
|user6|  103|  Cell|
|user7|  100|Fridge|
+-----+-----+------+

您也可以使用union，然后获得最大值：

from pyspark.sql import functions as F

result = df1.union(df2).groupBy("ID").agg(
    F.max("value").alias("value"),
    F.max("type").alias("type")
)

result.show()
#+-----+-----+------+
#|   ID|value|  type|
#+-----+-----+------+
#|user0|  100|   Car|
#|user1|  102|   Car|
#|user2|  109|   Dog|
#|user3|  103|    NA|
#|user4|  110|   Dog|
#|user5|  115|  Cell|
#|user6|  103|  Cell|
#|user7|  100|Fridge|
#+-----+-----+------+