Apache spark SparkSQL子查询与性能

Apache spark SparkSQL子查询与性能,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,为了允许系统用户(通过应用程序web UI)使用辅助数据动态创建不同的数据字典,我使用DataFrames并将其公开为临时表,例如: Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries") Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("me

为了允许系统用户(通过应用程序web UI)使用辅助数据动态创建不同的数据字典,我使用DataFrames并将其公开为临时表,例如:

Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
这些词典的数量仅受用户想象力和业务需求的限制

之后,用户还可以创建不同的查询,这些查询可以使用基于先前定义的辅助数据的条件,例如SQL
WHERE
conditions:

Q1: country IN (FROM medium_countries)
Q2: (TRUE = ((country IN (FROM medium_countries)) AND (country IN (FROM big_countries))) AND EMAIL IS NOT NULL) AND phone = '+91-9111999998'
Q3: TRUE = ((country IN (FROM medium_countries)) AND (country IN (FROM big_countries))) AND EMAIL IS NOT NULL
......
Qn: name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'
这些查询的数量仅受用户想象力和业务需求的限制

我现在最担心的是子查询,比如
country IN(来自中等国家)

根据系统设计,我不能在这里使用显式的
JOIN
,所以我只使用子查询。所以我有一个问题-通常这些辅助数据表的大小应该相对较小。。。我认为最坏的情况是几千行,这些表的总数——最坏的情况是几百行。考虑到这一点,这种方法会导致性能问题吗?是否存在可以优化该过程的技术,比如将这些字典缓存在内存中等等

已更新

现在我只能在Spark本地模式下测试它

查询:

country IN (FROM big_countries)
TRUE = ((country IN (FROM medium_countries)) AND (country IN (FROM big_countries))) AND EMAIL IS NOT NULL
执行计划:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|plan                                                                                                                                                                                                                                                                                                                                                                            |tag|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|== Physical Plan ==
*(1) Project [unique_id#27L]
+- *(1) BroadcastHashJoin [country#22], [country#3], LeftSemi, BuildRight
   :- *(1) Project [country#22, unique_id#27L]
   :  +- LocalTableScan [name#19, email#20, phone#21, country#22, unique_id#27L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      +- LocalTableScan [country#3]|big|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |tag|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|== Physical Plan ==
*(1) Project [unique_id#27L]
+- *(1) Filter (true = (exists#53 && exists#54))
   +- *(1) BroadcastHashJoin [country#22], [country#3], ExistenceJoin(exists#54), BuildRight
      :- *(1) BroadcastHashJoin [country#22], [country#8], ExistenceJoin(exists#53), BuildRight
      :  :- *(1) Project [country#22, unique_id#27L]
      :  :  +- *(1) Filter isnotnull(EMAIL#20)
      :  :     +- LocalTableScan [name#19, email#20, phone#21, country#22, unique_id#27L]
      :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      :     +- LocalTableScan [country#8]
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
         +- LocalTableScan [country#3]|big|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
查询:

country IN (FROM big_countries)
TRUE = ((country IN (FROM medium_countries)) AND (country IN (FROM big_countries))) AND EMAIL IS NOT NULL
执行计划:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|plan                                                                                                                                                                                                                                                                                                                                                                            |tag|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|== Physical Plan ==
*(1) Project [unique_id#27L]
+- *(1) BroadcastHashJoin [country#22], [country#3], LeftSemi, BuildRight
   :- *(1) Project [country#22, unique_id#27L]
   :  +- LocalTableScan [name#19, email#20, phone#21, country#22, unique_id#27L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      +- LocalTableScan [country#3]|big|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |tag|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
|== Physical Plan ==
*(1) Project [unique_id#27L]
+- *(1) Filter (true = (exists#53 && exists#54))
   +- *(1) BroadcastHashJoin [country#22], [country#3], ExistenceJoin(exists#54), BuildRight
      :- *(1) BroadcastHashJoin [country#22], [country#8], ExistenceJoin(exists#53), BuildRight
      :  :- *(1) Project [country#22, unique_id#27L]
      :  :  +- *(1) Filter isnotnull(EMAIL#20)
      :  :     +- LocalTableScan [name#19, email#20, phone#21, country#22, unique_id#27L]
      :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      :     +- LocalTableScan [country#8]
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
         +- LocalTableScan [country#3]|big|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+
我认为:

CACHE TABLE tbl  as in sql("CACHE TABLE tbl")
是您在以下情况下需要执行的操作:

...createOrReplaceTempView....
当然,在更大的问题之前

在SPARK now中,上面关于“缓存”的语句在默认情况下是渴望的,而不是懒惰的。正如手册所述,您不再需要手动触发缓存物化。也就是说,不再需要执行df.show或df.count

一旦进入内存-直到显式刷新,不需要每次都获取此数据,在这里它看起来没有过滤,而只是一次加载所有有限的数据集

不知道你的设计,但看看它,子查询应该是好的。尝试这种方法并查看物理计划。在传统的RDBMS中,这种类型的有限子查询——据我所见——也不是交易的破坏者

您还可以看到,物理计划显示Catalyst Optimizer已经将您的IN子查询优化/转换为联接,这是大型数据集的典型性能改进


因此,将较小的表“广播”到执行器的工作节点也可以提高性能。您可能不需要为广播设置任何限制,但您可以明确地设置此限制,但根据我观察到的情况,您可能认为这不是必需的

您能发布数据帧的物理执行计划吗?Dataframe.explain.Sure的输出更新了问题