Scala Spark SQL能否在内部联接上下推限制运算符?
Spark SQL似乎无法在内部联接表上下推限制运算符。当连接大型表以提取一小部分行时,这是一个问题。我正在测试Spark 2.2.1(最新版本) 下面是一个人为的示例,它运行在Scala Spark SQL能否在内部联接上下推限制运算符?,scala,apache-spark,join,apache-spark-sql,query-optimization,Scala,Apache Spark,Join,Apache Spark Sql,Query Optimization,Spark SQL似乎无法在内部联接表上下推限制运算符。当连接大型表以提取一小部分行时,这是一个问题。我正在测试Spark 2.2.1(最新版本) 下面是一个人为的示例,它运行在spark shell(Scala)中: 首先,设置表格: case class Customer(id: Long, name: String, email: String, zip: String) case class Order(id: Long, customer: Long, date: String, am
spark shell
(Scala)中:
首先,设置表格:
case class Customer(id: Long, name: String, email: String, zip: String)
case class Order(id: Long, customer: Long, date: String, amount: Long)
val customers = Seq(
Customer(0, "George Washington", "gwashington@usa.gov", "22121"),
Customer(1, "John Adams", "gwashington@usa.gov", "02169"),
Customer(2, "Thomas Jefferson", "gwashington@usa.gov", "22902"),
Customer(3, "James Madison", "gwashington@usa.gov", "22960"),
Customer(4, "James Monroe", "gwashington@usa.gov", "22902")
)
val orders = Seq(
Order(1, 1, "07/04/1776", 23456),
Order(2, 3, "03/14/1760", 7850),
Order(3, 2, "05/23/1784", 12400),
Order(4, 3, "09/03/1790", 6550),
Order(5, 4, "07/21/1795", 2550),
Order(6, 0, "11/27/1787", 1440)
)
import spark.implicits._
val customerTable = spark.sparkContext.parallelize(customers).toDS()
customerTable.createOrReplaceTempView("customer")
val orderTable = spark.sparkContext.parallelize(orders).toDS()
orderTable.createOrReplaceTempView("order")
现在运行以下联接查询,每个联接表都有一个限制和一个任意筛选器:
scala> val join = spark.sql("SELECT c.* FROM customer c JOIN order o ON c.id = o.customer WHERE c.id > 1 AND o.amount > 5000 LIMIT 1")
然后打印相应的优化执行计划:
scala> println(join.queryExecution.sparkPlan.toString)
CollectLimit 1
+- Project [id#5L, name#6, email#7, zip#8]
+- SortMergeJoin [id#5L], [customer#17L], Inner
:- Filter (id#5L > 1)
: +- SerializeFromObject [assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).id AS id#5L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).name, true) AS name#6, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).email, true) AS email#7, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).zip, true) AS zip#8]
: +- Scan ExternalRDDScan[obj#4]
+- Project [customer#17L]
+- Filter ((amount#19L > 5000) && (customer#17L > 1))
+- SerializeFromObject [assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).id AS id#16L, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).customer AS customer#17L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).date, true) AS date#18, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).amount AS amount#19L]
+- Scan ExternalRDDScan[obj#15]
您可以立即看到,这两个表在合并之前都进行了整体排序(尽管对于这些小示例表,Sort
步骤没有显示在SortMergeJoin
之前),只有在合并之后才应用限制
如果其中一个数据库包含数十亿行,则无论大小限制如何,此查询都会变得非常缓慢且资源密集
Spark能够优化这样的查询吗?或者,我可以解决这个问题而不损坏我的SQL而无法识别
Spark能够优化这样的查询吗
简言之,事实并非如此
使用旧术语,join
是一种广泛的转换,其中每个输出分区依赖于每个上游分区。因此,必须对两个父数据集进行完全扫描,以便计算子数据集的单个分区
如果您的目标是:
提取一小部分行
您应该考虑使用数据库,而不是Apache Skp.
不确定这是否对您的用例有帮助,但是您是否考虑在表本身中添加过滤器和限制子句?< /P> val orderTableLimited=orderTable.filter($“customer”> 1) .过滤器($“金额”>5000)。限制(1)
Spark 3.0添加了
LimitPushDown
,因此从该版本起,它将能够做到这一点。这不会向下推到sql