Dataframe pyspark：如何填充列中的值，并用另一个数据帧中的列替换为带条件的列_Dataframe_Pyspark

Dataframe pyspark：如何填充列中的值，并用另一个数据帧中的列替换为带条件的列

dataframe pyspark

Dataframe pyspark：如何填充列中的值，并用另一个数据帧中的列替换为带条件的列,dataframe,pyspark,Dataframe,Pyspark,我有两个数据帧。一个原始（40列）和另一个转换（60列）为了便于理解，我只提到了3个专栏 df1_原始，40列 ID city State 2 Montreal Quebec 3 Airdrie Alberta 4 Edmonton Alberta 5 Leduc Alberta 6 Brandon Manitoba 7 Winnipeg Manitoba 9 St. John Ne

我有两个数据帧。一个原始（40列）和另一个转换（60列）为了便于理解，我只提到了3个专栏

df1_原始，40列

ID     city      State
2      Montreal  Quebec
3      Airdrie   Alberta
4      Edmonton  Alberta
5      Leduc     Alberta
6      Brandon   Manitoba
7      Winnipeg  Manitoba
9      St. John  Newfoundland

df_转换为60列

ID     city      State    
2      Montreal  Quebec
3                Alberta
4      Edmonton  Alberta
5                Alberta
6      Brandon   Manitoba
7                Manitoba
9      St. John  Newfoundland

如果df_转换后的'city'列为null，那么我需要从df1_raw中获取'city'，并在“ID”上连接

因此，结果如下

 3      Airdrie   Alberta ....

如果我必须使用coalesce，那么有大量的列需要在join之后重命名和删除。有没有办法做到这一点？谢谢

这行吗？您只需在末尾重命名

city

列一次

from pyspark.sql.functions import coalesce

df_transformed.join(df1_raw.select('ID', 'city'), ['ID'], "left")\
                .withColumn('new_city', coalesce(df_transformed.city, df1_raw.city))\
                .drop('city').withColumnRenamed('new_city', 'city').show()

+---+------------+--------+
| ID|       State|    city|
+---+------------+--------+
|  2|      Quebec|Montreal|
|  3|     Alberta| Airdrie|
|  4|     Alberta|Edmonton|
|  5|     Alberta|   Leduc|
|  6|    Manitoba| Brandon|
|  7|    Manitoba|Winnipeg|
|  9|Newfoundland|St. John|
+---+------------+--------+

非常感谢您的解决方案。由于我在raw中有40列，在df_转换中有60列，当我加入这两个列时，我得到大约100列，因此删除40列实际上有点乏味。因此，我正在寻找任何其他选项。好的，用

select（'ID'，'city'）

替换

drop（'State'）

怎么样？现在您不必从

df1\u raw

中删除任何列。（编辑我的代码）是的，谢谢。为了绕过df_raw中的许多列，我刚刚创建了一个df_temp（'ID'，'city'），并将其与df_transformed连接起来，在coalace之后，删除了必要的列。谢谢你的建议。