Apache spark sparksql中的相关子查询_Apache Spark_Apache Spark Sql

Apache spark sparksql中的相关子查询

apache-spark

Apache spark sparksql中的相关子查询,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有以下两个表，我必须使用相关的子查询检查它们之间是否存在值要求是-对于orders表中的每条记录，检查custid表中是否存在相应的custid，然后输出一个值为Y的字段（名为FLAG），如果custid存在，否则N 订单： orderid | custid 12345 | XYZ 34566 | XYZ 68790 | MNP 59876 | QRS 15620 | UVW 客户： id | custid 1 | XYZ 2 | UVW 预期产出： orderi

我有以下两个表，我必须使用相关的子查询检查它们之间是否存在值

要求是-对于

orders

表中的每条记录，检查

custid

表中是否存在相应的

custid

，然后输出一个值为

的字段（名为

FLAG

），如果

custid

存在，否则

订单：

orderid | custid
12345   | XYZ
34566   | XYZ
68790   | MNP
59876   | QRS
15620   | UVW

客户：

id | custid
1  | XYZ
2  | UVW

预期产出：

orderid | custid  | FLAG
12345   | XYZ     | Y
34566   | XYZ     | Y 
68790   | MNP     | N
59876   | QRS     | N
15620   | UVW     | Y

我尝试了以下类似的方法，但没能成功-

select 
o.orderid,
o.custid,
case when o.custid EXISTS (select 1 from customer c on c.custid = o.custid)
     then 'Y'
     else 'N'
end as flag
from orders o

这可以通过相关标量子查询解决吗？如果不是，那么实现这一要求的最佳方式是什么
请告知
注意：使用Spark SQL查询v2.4.0

谢谢。
IN/EXISTS谓词子查询只能在Spark中的筛选器中使用
以下内容在本地重新创建的数据副本中起作用：

select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer from (select o.orderid, o.custid, c.custid existing_customer from orders o left join customer c on c.custid = o.custid)
以下是它如何处理重新创建的数据：

def textToView(csv: String, viewName: String) = { spark.read .option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true") .option("delimiter", "|") .option("header", "true") .csv(spark.sparkContext.parallelize(csv.split("\n")).toDS) .createOrReplaceTempView(viewName) } textToView("""id | custid 1 | XYZ 2 | UVW""", "customer") textToView("""orderid | custid 12345 | XYZ 34566 | XYZ 68790 | MNP 59876 | QRS 15620 | UVW""", "orders") spark.sql(""" select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer from (select o.orderid, o.custid, c.custid existing_customer from orders o left join customer c on c.custid = o.custid)""").show
返回：

+-------+------+-----------------+ |orderid|custid|existing_customer| +-------+------+-----------------+ | 59876| QRS| N| | 12345| XYZ| Y| | 34566| XYZ| Y| | 68790| MNP| N| | 15620| UVW| Y| +-------+------+-----------------+

IN/EXISTS谓词子查询只能在Spark中的筛选器中使用
以下内容在本地重新创建的数据副本中起作用：

select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer from (select o.orderid, o.custid, c.custid existing_customer from orders o left join customer c on c.custid = o.custid)
以下是它如何处理重新创建的数据：

def textToView(csv: String, viewName: String) = { spark.read .option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true") .option("delimiter", "|") .option("header", "true") .csv(spark.sparkContext.parallelize(csv.split("\n")).toDS) .createOrReplaceTempView(viewName) } textToView("""id | custid 1 | XYZ 2 | UVW""", "customer") textToView("""orderid | custid 12345 | XYZ 34566 | XYZ 68790 | MNP 59876 | QRS 15620 | UVW""", "orders") spark.sql(""" select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer from (select o.orderid, o.custid, c.custid existing_customer from orders o left join customer c on c.custid = o.custid)""").show
返回：

+-------+------+-----------------+ |orderid|custid|existing_customer| +-------+------+-----------------+ | 59876| QRS| N| | 12345| XYZ| Y| | 34566| XYZ| Y| | 68790| MNP| N| | 15620| UVW| Y| +-------+------+-----------------+

谢谢你投票支持我的回答。我现在已经测试了它是否可以与本地重新创建的数据副本一起工作。如果我的回答对你有用，你会接受吗？我和你一样是新手，每一分都很重要！谢谢大家!@谢谢你的回复。你的解决方案非常有效，我已经接受了答案。但是，我只想知道是否有任何方法可以对
EXISTS
子句/
相关子查询
（以我的理解）执行相同的操作？可能在将来。例如，您可以在Oracle中执行此操作。当您在spark 2.4中尝试时，实际上会得到一个信息性错误，即“in/EXISTS谓词子查询只能在spark中的筛选器中使用”。信息性错误，尽管这不是您所希望的。感谢您对我的答案进行投票。我现在已经测试了它是否可以与本地重新创建的数据副本一起工作。如果我的回答对你有用，你会接受吗？我和你一样是新手，每一分都很重要！谢谢大家!@谢谢你的回复。你的解决方案非常有效，我已经接受了答案。但是，我只想知道是否有任何方法可以对
EXISTS
子句/
相关子查询
（以我的理解）执行相同的操作？可能在将来。例如，您可以在Oracle中执行此操作。当您在spark 2.4中尝试它时，实际上会得到一个信息性错误，即“in/EXISTS谓词子查询只能用于spark中的过滤器。”虽然这不是您希望的信息性错误。