Apache spark Spark显示基于成本的优化器统计信息

Apache spark Spark显示基于成本的优化器统计信息,apache-spark,apache-spark-sql,databricks,Apache Spark,Apache Spark Sql,Databricks,我已尝试通过在Spark shell中设置属性来启用Spark cbo spark.conf.setspark.sql.cbo.enabled,true 我现在正在运行spark.sqlANALYZE表events COMPUTE STATISTICS.show 运行此查询不会从eventID=1.ExplantRue的事件中显示任何统计信息spark.sqlselect* 在Spark 2.2.1上运行此功能 scala> spark.sql("select * from events

我已尝试通过在Spark shell中设置属性来启用Spark cbo spark.conf.setspark.sql.cbo.enabled,true

我现在正在运行spark.sqlANALYZE表events COMPUTE STATISTICS.show

运行此查询不会从eventID=1.ExplantRue的事件中显示任何统计信息spark.sqlselect*

在Spark 2.2.1上运行此功能

scala> spark.sql("select * from events where eventID=1").explain()
== Physical Plan ==
*Project [buyDetails.capacity#923, buyDetails.clearingNumber#924, buyDetails.leavesQty#925L, buyDetails.liquidityCode#926, buyDetails.orderID#927, buyDetails.side#928, cancelQty#929L, capacity#930, clearingNumber#931, contraClearingNumber#932, desiredLeavesQty#933L, displayPrice#934, displayQty#935L, eventID#936, eventTimestamp#937L, exchange#938, executionCodes#939, fillID#940, handlingInstructions#941, initiator#942, leavesQty#943L, nbbPrice#944, nbbQty#945L, nboPrice#946, ... 29 more fields]
+- *Filter (isnotnull(eventID#936) && (cast(eventID#936 as int) = 1))
+- *FileScan parquet default.events[buyDetails.capacity#923,buyDetails.clearingNumber#924,buyDetails.leavesQty#925L,buyDetails.liquidityCode#926,buyDetails.orderID#927,buyDetails.side#928,cancelQty#929L,capacity#930,clearingNumber#931,contraClearingNumber#932,desiredLeavesQty#933L,displayPrice#934,displayQty#935L,eventID#936,eventTimestamp#937L,exchange#938,executionCodes#939,fillID#940,handlingInstructions#941,initiator#942,leavesQty#943L,nbbPrice#944,nbbQty#945L,nboPrice#946,... 29 more fields] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/home/asehgal/data/events], PartitionFilters: [], PushedFilters: [IsNotNull(eventID)], ReadSchema: struct<buyDetails.capacity:string,buyDetails.clearingNumber:string,buyDetails.leavesQty:bigint,bu...

对我来说,数据在df.explaintrue中不可见。我玩了一会儿,可以使用printlndf.queryExecution.stringWithStats打印统计数据,完整示例:

val ss = SparkSession
  .builder()
  .master("local[*]")
  .appName("TestCBO")
  .config("spark.sql.cbo.enabled",true)
  .getOrCreate()

import ss.implicits._

val df1 = ss.range(10000L).toDF("i")
df1.write.mode("overwrite").saveAsTable("table1")

val df2 = ss.range(100000L).toDF("i")
df2.write.mode("overwrite").saveAsTable("table2")

ss.sql("ANALYZE TABLE table1 COMPUTE STATISTICS FOR COLUMNS i")
ss.sql("ANALYZE TABLE table2 COMPUTE STATISTICS FOR COLUMNS i")

val df = ss.table("table1").join(ss.table("table2"), "i")
  .where($"i" > 1000)

println(df.queryExecution.stringWithStats)
给予

这在标准df.explain中没有显示,因为这会激发Dataset.scala:

ExplainCommand(queryExecution.logical, extended = true) //  cost = false in this constructor
为了能够输出成本,我们可以自己调用这个ExplainCommand:

在这里,您还可以启用生成的代码集codegen=true的输出

或者,这会提供类似的输出

df // join of two dataframes and filter
 .registerTempTable("tmp")
ss.sql("EXPLAIN COST select * from tmp").show(false)
要查看SparkUI中的统计信息,必须转到SQL选项卡,然后在本例中选择相应的查询df.show:


我不确定您是否在物理计划中看到了统计数据,您从何处获得这些信息?我查阅了Apache Spark的文档,并尝试使用Spark 2.3,在查询计划中也没有统计数据,在Spark中也没有统计数据。看起来不错!但是这些数据在某种程度上对你来说是正确的吗?它向我显示了统计数据,但即使在计算了列的统计数据之后,结果也是非常随机的。嘿@RaphaelRoth,你能分享你在spark conf中设置的属性吗。你在运行Explain cost之前分析了表中的所有列吗?@RajatMishra我添加了一个完整的示例
import org.apache.spark.sql.execution.command.ExplainCommand
val explain = ExplainCommand(df.queryExecution.logical, extended = true, cost = true)
ss.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
  r => println(r.getString(0))
}
df // join of two dataframes and filter
 .registerTempTable("tmp")
ss.sql("EXPLAIN COST select * from tmp").show(false)