Apache spark Spark为数据帧联接指定多列条件
如何在连接两个数据帧时提供更多列条件。例如,我想运行以下程序:Apache spark Spark为数据帧联接指定多列条件,apache-spark,apache-spark-sql,rdd,Apache Spark,Apache Spark Sql,Rdd,如何在连接两个数据帧时提供更多列条件。例如,我想运行以下程序: val Lead_all = Leads.join(Utm_Master, Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") == Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left") 我只想在这些列匹
val Lead_all = Leads.join(Utm_Master,
Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")
我只想在这些列匹配时加入。但上述语法无效,因为cols只接受一个字符串。那么我如何得到我想要的。您可以做的一件事是使用原始SQL:
case class Bar(x1: Int, y1: Int, z1: Int, v1: String)
case class Foo(x2: Int, y2: Int, z2: Int, v2: String)
val bar = sqlContext.createDataFrame(sc.parallelize(
Bar(1, 1, 2, "bar") :: Bar(2, 3, 2, "bar") ::
Bar(3, 1, 2, "bar") :: Nil))
val foo = sqlContext.createDataFrame(sc.parallelize(
Foo(1, 1, 2, "foo") :: Foo(2, 1, 2, "foo") ::
Foo(3, 1, 2, "foo") :: Foo(4, 4, 4, "foo") :: Nil))
foo.registerTempTable("foo")
bar.registerTempTable("bar")
sqlContext.sql(
"SELECT * FROM foo LEFT JOIN bar ON x1 = x2 AND y1 = y2 AND z1 = z2")
在这种情况下有一个火花:
Leaddetails.join(
Utm_硕士,
Leaddetails(“LeadSource”)Utm_Master(“LeadSource”)
&&Leaddetails(“Utm\U源”)Utm\U主控(“Utm\U源”)
&&领先详细信息(“Utm_中”)Utm_主控(“Utm_中”)
&&主要信息(“Utm_活动”)Utm_主信息(“Utm_活动”),
“左”
)
示例中的
运算符表示“”
simple(=
)的主要区别在于,如果其中一列可能有空值,第一列可以安全使用。自Spark 1.5.0版(目前尚未发布)起,您可以在多个数据帧列上进行联接。参考
Python
Leads.join(
Utm_Master,
["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],
"left_outer"
)
Scala
这个问题需要Scala的回答,但我不使用Scala。这是我最好的猜测
Scala:
Leaddetails.join(
Utm_Master,
Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
&& Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
&& Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
&& Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
"left"
)
然后在join方法的条件下使用lower(value)
例如:
dataFrame.filter(lower(dataFrame.col(“供应商”)).equalTo(“fortinet”))
在Pyspark中,您可以简单地分别指定每个条件:
val Lead_all = Leads.join(Utm_Master,
(Leaddetails.LeadSource == Utm_Master.LeadSource) &
(Leaddetails.Utm_Source == Utm_Master.Utm_Source) &
(Leaddetails.Utm_Medium == Utm_Master.Utm_Medium) &
(Leaddetails.Utm_Campaign == Utm_Master.Utm_Campaign))
请确保正确使用运算符和括号。Spark SQL在括号中时支持列元组上的联接,如
... WHERE (list_of_columns1) = (list_of_columns2)
这比为一组“和”组合的每对列指定相等表达式(=)要短
例如:
SELECT a,b,c
FROM tab1 t1
WHERE
NOT EXISTS
( SELECT 1
FROM t1_except_t2_df e
WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
)
而不是
SELECT a,b,c
FROM tab1 t1
WHERE
NOT EXISTS
( SELECT 1
FROM t1_except_t2_df e
WHERE t1.a=e.a AND t1.b=e.b AND t1.c=e.c
)
它的可读性也较低,尤其是当列列表较大且您希望轻松处理空值时。使用
==
选项可以获得重复的列。因此我使用Seq
val Lead_all = Leads.join(Utm_Master,
Seq("Utm_Source","Utm_Medium","Utm_Campaign"),"left")
当然,这只有在连接列的名称相同时才有效。尝试以下方法:
val rccJoin=dfRccDeuda.as("dfdeuda")
.join(dfRccCliente.as("dfcliente")
,col("dfdeuda.etarcid")===col("dfcliente.etarcid")
&& col("dfdeuda.etarcid")===col("dfcliente.etarcid"),"inner")
在Pyspark中,在每个条件周围使用括号是在连接条件中使用多个列名的关键
joined_df = df1.join(df2,
(df1['name'] == df2['name']) &
(df1['phone'] == df2['phone'])
)
这就是我现在使用的方法。我希望不用注册临时表也能做到。如果dataframe API无法做到这一点,我将接受答案。如果是这样,@rchukh的答案要好得多。您能解释一下
===
和
之间的区别吗?更新了关于这些相等性测试之间差异的更多信息。啊哈,在文档中找不到这一点。你怎么知道的?@user568109我正在使用Java API,有些情况下列/表达式API是唯一的选项。此外,列/表达式API主要作为构建器实现,因此更容易在每个版本的Spark上发现新方法。这给了我重复的列,因此我使用了我在另一个答案中添加的Seq方法。如何使连接忽略值大小写(即使其不区分大小写)?我在下面试过,但没有成功。sqlContext.sql(“set spark.sql.caseSensitive=false”)它真的有效吗?1.6版本是否支持此功能?
val rccJoin=dfRccDeuda.as("dfdeuda")
.join(dfRccCliente.as("dfcliente")
,col("dfdeuda.etarcid")===col("dfcliente.etarcid")
&& col("dfdeuda.etarcid")===col("dfcliente.etarcid"),"inner")
joined_df = df1.join(df2,
(df1['name'] == df2['name']) &
(df1['phone'] == df2['phone'])
)