Python 如何在pyspark中有条件的3个表上使用联接?(多个表格)

Python 如何在pyspark中有条件的3个表上使用联接?(多个表格),python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我想从另外两个表中获取要在一个表中更新的列。这就像mysql更新语句- UPDATE bucket_summary a,geo_count b, geo_state c SET a.category_name=b.county_name, a.state_code=c.state_code WHERE a.category_id=b.county_geoid AND b.state_fips=c.state_fips AND a.category='count

我想从另外两个表中获取要在一个表中更新的列。这就像mysql更新语句-

   UPDATE bucket_summary a,geo_count b, geo_state c
   SET a.category_name=b.county_name,
   a.state_code=c.state_code
   WHERE a.category_id=b.county_geoid
   AND b.state_fips=c.state_fips
   AND a.category='county' 
这个怎么写

  condition = [a.category_id=b.county_geoid, b.state_fips=c.state_fips, a.category='county']
  df_a = df_a.join([df_b, df_c], condition, how= left) 

不适用于我

您必须执行两个不同的联接。a.category==“country”不能处于联接条件

df_a.filterdf_a.category=='county'.joindf_b,df_a.category==df_b.county_大地水准面,leftouter.joindf_c',state_fips','leftouter' 希望这有帮助

import pyspark.sql.functions as f

########
# data
########
df_a = sc.parallelize([
    [None,  None,  '123', 'country'],
    ['sc2', 'cn2', '234', 'state'],
    ['sc3', 'cn3', '456', 'country']
]).toDF(('state_code', 'category_name', 'category_id', 'category'))
df_a.show()

df_b = sc.parallelize([
    ['789','United States', 'asdf'],
    ['234','California',    'abc'],
    ['456','United Kingdom','xyz']
]).toDF(('county_geoid', 'country_name', 'state_fips'))

df_c = sc.parallelize([
    ['US','asdf'],
    ['CA','abc'],
    ['UK','xyz']
]).toDF(('state_code', 'state_fips'))
df_c = df_c.select(*(f.col(x).alias(x + '_df_c') for x in df_c.columns))

########
# update df_a with values from df_b & df_c
########
df_temp = df_a.join(df_b, [df_a.category_id == df_b.county_geoid, df_a.category=='country'], 'left').drop('county_geoid')
df_temp = df_temp.withColumn('category_name_new',
                   f.when(df_temp.country_name.isNull(), df_temp.category_name).
                   otherwise(df_temp.country_name)).drop('category_name','country_name').\
                   withColumnRenamed('category_name_new','category_name')
df_a = df_temp.join(df_c,[df_temp.state_fips == df_c.state_fips_df_c, df_temp.category=='country'], 'left').drop('state_fips_df_c','state_fips')
df_a = df_a.withColumn('state_code_new',
                   f.when(df_a.state_code_df_c.isNull(), df_a.state_code).
                   otherwise(df_a.state_code_df_c)).drop('state_code_df_c','state_code').\
                   withColumnRenamed('state_code_new','state_code')
df_a.show()
原始df_a:

+----------+-------------+-----------+--------+
|state_code|category_name|category_id|category|
+----------+-------------+-----------+--------+
|      null|         null|        123| country|
|       sc2|          cn2|        234|   state|
|       sc3|          cn3|        456| country|
+----------+-------------+-----------+--------+
+-----------+--------+--------------+----------+
|category_id|category| category_name|state_code|
+-----------+--------+--------------+----------+
|        234|   state|           cn2|       sc2|
|        123| country|          null|      null|
|        456| country|United Kingdom|        UK|
+-----------+--------+--------------+----------+
输出,即最终df_a:

+----------+-------------+-----------+--------+
|state_code|category_name|category_id|category|
+----------+-------------+-----------+--------+
|      null|         null|        123| country|
|       sc2|          cn2|        234|   state|
|       sc3|          cn3|        456| country|
+----------+-------------+-----------+--------+
+-----------+--------+--------------+----------+
|category_id|category| category_name|state_code|
+-----------+--------+--------------+----------+
|        234|   state|           cn2|       sc2|
|        123| country|          null|      null|
|        456| country|United Kingdom|        UK|
+-----------+--------+--------------+----------+

以及最后一次联接末尾的select语句?1。不,过滤器不起作用,最后需要整个df_a。因此,最好将county条件作为and添加到两个联接中。因此,我们需要将条件拆分为两个语句。是的,这样更好地理解!