Python 如何在pyspark中使用具有多种条件的联接？_Python_Apache Spark_Spark Dataframe

Python 如何在pyspark中使用具有多种条件的联接？

python apache-spark

Python 如何在pyspark中使用具有多种条件的联接？,python,apache-spark,spark-dataframe,Python,Apache Spark,Spark Dataframe,我能够将dataframe join语句与single on条件一起使用（在pyspark中），但是，如果我尝试添加多个条件，那么它将失败代码： summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter"). 上述代码有效。然而，如果我为列表添加一些其他条件，比如summary.bucket==9或其他什么，它就会失败。请帮我解决这个问题 The error for the s

我能够将dataframe join语句与single on条件一起使用（在pyspark中），但是，如果我尝试添加多个条件，那么它将失败

代码：

   summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").

上述代码有效。然而，如果我为列表添加一些其他条件，比如summary.bucket==9或其他什么，它就会失败。请帮我解决这个问题

   The error for the statement 
   summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")

   ERROR : TypeError: 'Column' object is not callable

编辑：

添加完整的工作示例

   schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
   bucket_summary = sqlContext.createDataFrame([],schema)

   temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
   bucket_summary = bucket_summary.unionAll(temp_county_prop)
   county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)

要在以下位置执行联接：

category_id和bucket列，我想替换bucket_summary上country_prop的值

   cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]

bucket\u summary2=bucket\u summary.join（country\u prop，cond，how=“leftouter”）

它不起作用。2语句会出什么问题？

例如

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

但在您的情况下，

（summary.bucket）==9

不应显示为连接条件

更新：

在连接条件中您可以使用

列连接表达式的列表

或
列/列名称的列表
请提供一个完整的可复制示例在您的示例中，您不能在执行
连接之前简单地执行一个过滤器（$“bucket”==9）？@mtoto，我已经添加了这个例子，并用更多的发现更新了这个问题。它也适用于bucket==9，唯一的失败是当我在条件行中编写组合时，比如：cond=[“bucket”，bucket\u summary.category\u id==“state”] df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')