Apache spark Spark为数据帧联接指定多列条件

Apache spark Spark为数据帧联接指定多列条件,apache-spark,apache-spark-sql,rdd,Apache Spark,Apache Spark Sql,Rdd,如何在连接两个数据帧时提供更多列条件。例如,我想运行以下程序: val Lead_all = Leads.join(Utm_Master, Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") == Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left") 我只想在这些列匹

如何在连接两个数据帧时提供更多列条件。例如,我想运行以下程序:

val Lead_all = Leads.join(Utm_Master,  
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

我只想在这些列匹配时加入。但上述语法无效,因为cols只接受一个字符串。那么我如何得到我想要的。

您可以做的一件事是使用原始SQL:

case class Bar(x1: Int, y1: Int, z1: Int, v1: String)
case class Foo(x2: Int, y2: Int, z2: Int, v2: String)

val bar = sqlContext.createDataFrame(sc.parallelize(
    Bar(1, 1, 2, "bar") :: Bar(2, 3, 2, "bar") ::
    Bar(3, 1, 2, "bar") :: Nil))

val foo = sqlContext.createDataFrame(sc.parallelize(
    Foo(1, 1, 2, "foo") :: Foo(2, 1, 2, "foo") ::
    Foo(3, 1, 2, "foo") :: Foo(4, 4, 4, "foo") :: Nil))

foo.registerTempTable("foo")
bar.registerTempTable("bar")

sqlContext.sql(
    "SELECT * FROM foo LEFT JOIN bar ON x1 = x2 AND y1 = y2 AND z1 = z2")
在这种情况下有一个火花:

Leaddetails.join(
Utm_硕士,
Leaddetails(“LeadSource”)Utm_Master(“LeadSource”)
&&Leaddetails(“Utm\U源”)Utm\U主控(“Utm\U源”)
&&领先详细信息(“Utm_中”)Utm_主控(“Utm_中”)
&&主要信息(“Utm_活动”)Utm_主信息(“Utm_活动”),
“左”
)
示例中的
运算符表示“”

simple(
=
)的主要区别在于,如果其中一列可能有空值,第一列可以安全使用。

自Spark 1.5.0版(目前尚未发布)起,您可以在多个数据帧列上进行联接。参考

Python

Leads.join(
    Utm_Master, 
    ["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],
    "left_outer"
)
Scala

这个问题需要Scala的回答,但我不使用Scala。这是我最好的猜测


Scala:

Leaddetails.join(
    Utm_Master, 
    Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
        && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)
然后在join方法的条件下使用
lower(value)


例如:
dataFrame.filter(lower(dataFrame.col(“供应商”)).equalTo(“fortinet”))
Pyspark中,您可以简单地分别指定每个条件:

val Lead_all = Leads.join(Utm_Master,  
    (Leaddetails.LeadSource == Utm_Master.LeadSource) &
    (Leaddetails.Utm_Source == Utm_Master.Utm_Source) &
    (Leaddetails.Utm_Medium == Utm_Master.Utm_Medium) &
    (Leaddetails.Utm_Campaign == Utm_Master.Utm_Campaign))

请确保正确使用运算符和括号。

Spark SQL在括号中时支持列元组上的联接,如

... WHERE (list_of_columns1) = (list_of_columns2)
这比为一组“和”组合的每对列指定相等表达式(=)要短

例如:

SELECT a,b,c
FROM    tab1 t1
WHERE 
   NOT EXISTS
   (    SELECT 1
        FROM    t1_except_t2_df e
        WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
   )
而不是

SELECT a,b,c
FROM    tab1 t1
WHERE 
   NOT EXISTS
   (    SELECT 1
        FROM    t1_except_t2_df e
        WHERE t1.a=e.a AND t1.b=e.b AND t1.c=e.c
   )

它的可读性也较低,尤其是当列列表较大且您希望轻松处理空值时。

使用
==
选项可以获得重复的列。因此我使用
Seq

val Lead_all = Leads.join(Utm_Master,
    Seq("Utm_Source","Utm_Medium","Utm_Campaign"),"left")
当然,这只有在连接列的名称相同时才有效。

尝试以下方法:

val rccJoin=dfRccDeuda.as("dfdeuda")
.join(dfRccCliente.as("dfcliente")
,col("dfdeuda.etarcid")===col("dfcliente.etarcid") 
&& col("dfdeuda.etarcid")===col("dfcliente.etarcid"),"inner")

Pyspark中,在每个条件周围使用括号是在连接条件中使用多个列名的关键

joined_df = df1.join(df2, 
    (df1['name'] == df2['name']) &
    (df1['phone'] == df2['phone'])
)

这就是我现在使用的方法。我希望不用注册临时表也能做到。如果dataframe API无法做到这一点,我将接受答案。如果是这样,@rchukh的答案要好得多。您能解释一下
===
之间的区别吗?更新了关于这些相等性测试之间差异的更多信息。啊哈,在文档中找不到这一点。你怎么知道的?@user568109我正在使用Java API,有些情况下列/表达式API是唯一的选项。此外,列/表达式API主要作为构建器实现,因此更容易在每个版本的Spark上发现新方法。这给了我重复的列,因此我使用了我在另一个答案中添加的Seq方法。如何使连接忽略值大小写(即使其不区分大小写)?我在下面试过,但没有成功。sqlContext.sql(“set spark.sql.caseSensitive=false”)它真的有效吗?1.6版本是否支持此功能?
val rccJoin=dfRccDeuda.as("dfdeuda")
.join(dfRccCliente.as("dfcliente")
,col("dfdeuda.etarcid")===col("dfcliente.etarcid") 
&& col("dfdeuda.etarcid")===col("dfcliente.etarcid"),"inner")
joined_df = df1.join(df2, 
    (df1['name'] == df2['name']) &
    (df1['phone'] == df2['phone'])
)