Apache spark 如何在spark sql中进行左外连接?

Apache spark 如何在spark sql中进行左外连接?,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我试图在spark(1.6.2)中做一个左外连接,但它不起作用。我的sql查询如下所示: sqlContext.sql("select t.type, t.uuid, p.uuid from symptom_type t LEFT JOIN plugin p ON t.uuid = p.uuid where t.created_year = 2016 and p.created_year = 2016").show() +--------------------+------------

我试图在spark(1.6.2)中做一个左外连接,但它不起作用。我的sql查询如下所示:

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()
+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|               type|              s_uuid|           p_uuid|        created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
结果如下:

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()
+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|               type|              s_uuid|           p_uuid|        created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
我使用左连接或左外连接得到了相同的结果(第二个uuid不是null)

我希望第二个uuid列仅为null。如何正确进行左外连接

==其他信息==

若我使用dataframe做左外连接,我得到了正确的结果

s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')

s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'), 
        p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()
我得到的结果是这样的:

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()
+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|               type|              s_uuid|           p_uuid|        created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|

谢谢,

我想你只需要使用
左外连接
而不是
左连接
关键字就可以了。有关更多信息,请查看。

我没有发现您的代码中存在任何问题。“左连接”或“左外连接”都可以正常工作。请再次检查数据您显示的数据是否匹配

您还可以使用以下命令执行Spark SQL联接:

//左外联接显式

df1.join(df2, df1["col1"] == df2["col1"], "left_outer")

您正在使用筛选出
p.created\u year
(以及
p.uuid
)的空值

避免这种情况的方法是将
p
的筛选子句移动到ON语句:

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
and p.created_year = 2016
where t.created_year = 2016").show()
这是正确的,但效率低下,因为我们还需要在加入之前过滤
t.created\u year
。因此,建议使用子查询:

sqlContext.sql("select t.type, t.uuid, p.uuid
from (
  SELECT type, uuid FROM symptom_type WHERE created_year = 2016 
) t LEFT JOIN (
  SELECT uuid FROM plugin WHERE created_year = 2016
) p 
ON t.uuid = p.uuid").show()    

我试过了,它仍然显示出与内部联接类似的结果。第二个uuid上没有null。我添加了使用dataframe的结果。我认为sql查询中的第二个uuid不是空的对我来说是个问题。看起来它只是执行内部连接请检查以下内容:-应该有两个
==
,而不是三个语法根据语言的不同而有所不同,因为pySpark需要“==”,而Scala需要“==”(三个=)。示例-df1.join(df2,$“df1Key”==$“df2Key”)他的代码有一个明显的问题。他在where子句中为联接右表中的列包含一个筛选条件,从而产生一个内部联接。这是初学者常见的SQL错误。有关更多信息,请参阅。