Apache spark 如何在spark sql中进行左外连接?
我试图在spark(1.6.2)中做一个左外连接,但它不起作用。我的sql查询如下所示:Apache spark 如何在spark sql中进行左外连接?,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我试图在spark(1.6.2)中做一个左外连接,但它不起作用。我的sql查询如下所示: sqlContext.sql("select t.type, t.uuid, p.uuid from symptom_type t LEFT JOIN plugin p ON t.uuid = p.uuid where t.created_year = 2016 and p.created_year = 2016").show() +--------------------+------------
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
where t.created_year = 2016
and p.created_year = 2016").show()
+--------------------+--------------------+--------------------+
| type| uuid| uuid|
+--------------------+--------------------+--------------------+
| tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
| swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| type| s_uuid| p_uuid| created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
结果如下:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
where t.created_year = 2016
and p.created_year = 2016").show()
+--------------------+--------------------+--------------------+
| type| uuid| uuid|
+--------------------+--------------------+--------------------+
| tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
| swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| type| s_uuid| p_uuid| created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
我使用左连接或左外连接得到了相同的结果(第二个uuid不是null)
我希望第二个uuid列仅为null。如何正确进行左外连接
==其他信息==
若我使用dataframe做左外连接,我得到了正确的结果
s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')
s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'),
p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()
我得到的结果是这样的:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
where t.created_year = 2016
and p.created_year = 2016").show()
+--------------------+--------------------+--------------------+
| type| uuid| uuid|
+--------------------+--------------------+--------------------+
| tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
| swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| type| s_uuid| p_uuid| created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
谢谢,我想你只需要使用
左外连接
而不是左连接
关键字就可以了。有关更多信息,请查看。我没有发现您的代码中存在任何问题。“左连接”或“左外连接”都可以正常工作。请再次检查数据您显示的数据是否匹配
您还可以使用以下命令执行Spark SQL联接:
//左外联接显式
df1.join(df2, df1["col1"] == df2["col1"], "left_outer")
您正在使用筛选出
p.created\u year
(以及p.uuid
)的空值
避免这种情况的方法是将p
的筛选子句移动到ON语句:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
and p.created_year = 2016
where t.created_year = 2016").show()
这是正确的,但效率低下,因为我们还需要在加入之前过滤t.created\u year
。因此,建议使用子查询:
sqlContext.sql("select t.type, t.uuid, p.uuid
from (
SELECT type, uuid FROM symptom_type WHERE created_year = 2016
) t LEFT JOIN (
SELECT uuid FROM plugin WHERE created_year = 2016
) p
ON t.uuid = p.uuid").show()
我试过了,它仍然显示出与内部联接类似的结果。第二个uuid上没有null。我添加了使用dataframe的结果。我认为sql查询中的第二个uuid不是空的对我来说是个问题。看起来它只是执行内部连接请检查以下内容:-应该有两个
==
,而不是三个语法根据语言的不同而有所不同,因为pySpark需要“==”,而Scala需要“==”(三个=)。示例-df1.join(df2,$“df1Key”==$“df2Key”)他的代码有一个明显的问题。他在where子句中为联接右表中的列包含一个筛选条件,从而产生一个内部联接。这是初学者常见的SQL错误。有关更多信息,请参阅。