Apache spark 如何在spark sql中进行左外连接？_Apache Spark_Pyspark_Apache Spark Sql

Apache spark 如何在spark sql中进行左外连接？

apache-spark pyspark

Apache spark 如何在spark sql中进行左外连接？,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我试图在spark（1.6.2）中做一个左外连接，但它不起作用。我的sql查询如下所示： sqlContext.sql("select t.type, t.uuid, p.uuid from symptom_type t LEFT JOIN plugin p ON t.uuid = p.uuid where t.created_year = 2016 and p.created_year = 2016").show() +--------------------+------------

我试图在spark（1.6.2）中做一个左外连接，但它不起作用。我的sql查询如下所示：

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()

+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|

+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|               type|              s_uuid|           p_uuid|        created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|

结果如下：

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()

+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|

+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|               type|              s_uuid|           p_uuid|        created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|

我使用左连接或左外连接得到了相同的结果（第二个uuid不是null）

我希望第二个uuid列仅为null。如何正确进行左外连接

==其他信息==

若我使用dataframe做左外连接，我得到了正确的结果

s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')

s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'), 
        p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()

我得到的结果是这样的：

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()

+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|

+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|               type|              s_uuid|           p_uuid|        created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|

谢谢，

我想你只需要使用

左外连接

而不是

左连接

关键字就可以了。有关更多信息，请查看。

我没有发现您的代码中存在任何问题。“左连接”或“左外连接”都可以正常工作。请再次检查数据您显示的数据是否匹配

您还可以使用以下命令执行Spark SQL联接：

//左外联接显式

df1.join(df2, df1["col1"] == df2["col1"], "left_outer")

您正在使用筛选出

p.created\u year

（以及

p.uuid

）的空值

避免这种情况的方法是将

的筛选子句移动到ON语句：

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
and p.created_year = 2016
where t.created_year = 2016").show()

这是正确的，但效率低下，因为我们还需要在加入之前过滤

t.created\u year

。因此，建议使用子查询：

sqlContext.sql("select t.type, t.uuid, p.uuid
from (
  SELECT type, uuid FROM symptom_type WHERE created_year = 2016 
) t LEFT JOIN (
  SELECT uuid FROM plugin WHERE created_year = 2016
) p 
ON t.uuid = p.uuid").show()

我试过了，它仍然显示出与内部联接类似的结果。第二个uuid上没有null。我添加了使用dataframe的结果。我认为sql查询中的第二个uuid不是空的对我来说是个问题。看起来它只是执行内部连接请检查以下内容：-应该有两个

==

，而不是三个语法根据语言的不同而有所不同，因为pySpark需要“==”，而Scala需要“==”（三个=）。示例-df1.join（df2，$“df1Key”==$“df2Key”）他的代码有一个明显的问题。他在where子句中为联接右表中的列包含一个筛选条件，从而产生一个内部联接。这是初学者常见的SQL错误。有关更多信息，请参阅。