Apache spark 火花管道的性能影响_Apache Spark_Pyspark_Apache Spark Sql

Apache spark 火花管道的性能影响

apache-spark pyspark

Apache spark 火花管道的性能影响,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,使用SQLTransformers我们可以在数据帧中创建新列，并拥有这些SQLTransformers的管道。我们也可以通过多次调用数据帧上的selectExpr方法来完成同样的事情但是，应用于selectExpr调用的性能优化指标是否也应用于SQLTransformers管道例如，考虑下面两个代码片段： #Method 1 df = spark.table("transactions") df = df.selectExpr("*","sum(amt) over (partition by

使用

SQLTransformers

我们可以在数据帧中创建新列，并拥有这些

SQLTransformers

的

管道。我们也可以通过多次调用数据帧上的selectExpr
方法来完成同样的事情
但是，应用于selectExpr调用的性能优化指标是否也应用于SQLTransformers
管道
例如，考虑下面两个代码片段：
#Method 1
df = spark.table("transactions")
df = df.selectExpr("*","sum(amt) over (partition by account) as acc_sum")
df = df.selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
df.show(10)

#Method 2
df = spark.table("transactions")
trans1 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by account) as acc_sum from __THIS__")
trans2 = SQLTransformer(statement ="SELECT *,sum(amt) over (partition by dt) as dt_sum from __THIS__")
pipe = Pipeline(stage[trans1,trans2])
transPipe = pipe.fit(df)
transPipe.transform(df).show(10)

这两种计算同一事物的方法的性能是否相同
或者会有一些额外的优化应用于方法1，但在方法2中没有使用？
没有额外的优化。一如既往，如有疑问，检查执行计划：
df = spark.createDataFrame([(1, 1, 1)], ("amt", "account", "dt"))

(df
    .selectExpr("*","sum(amt) over (partition by account) as acc_sum")
    .selectExpr("*","sum(amt) over (partition by dt) as dt_sum")
    .explain(True))

生成：
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#165]
+- AnalysisBarrier Project [amt#22L, account#23L, dt#24L, acc_sum#158L]

== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, dt_sum#165L, dt_sum#165L]
   +- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
      +- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
         +- Project [amt#22L, account#23L, dt#24L, acc_sum#158L]
            +- Project [amt#22L, account#23L, dt#24L, acc_sum#158L, acc_sum#158L]
               +- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
                  +- Project [amt#22L, account#23L, dt#24L]
                     +- LogicalRDD [amt#22L, account#23L, dt#24L], false

== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
   +- LogicalRDD [amt#22L, account#23L, dt#24L], false

== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#165L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(dt#24L, 200)
      +- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#158L], [account#23L]
         +- *Sort [account#23L ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(account#23L, 200)
               +- Scan ExistingRDD[amt#22L,account#23L,dt#24L]

当
产生
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#150]
+- 'UnresolvedRelation `SQLTransformer_4318bd7007cefbf17a97_826abb6c003c`

== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L, dt_sum#150L]
   +- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
      +- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
         +- SubqueryAlias sqltransformer_4318bd7007cefbf17a97_826abb6c003c
            +- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
               +- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, acc_sum#120L]
                  +- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
                     +- Project [amt#22L, account#23L, dt#24L]
                        +- SubqueryAlias sqltransformer_4688bba599a7f5a09c39_f5e9d251099e
                           +- LogicalRDD [amt#22L, account#23L, dt#24L], false

== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
   +- LogicalRDD [amt#22L, account#23L, dt#24L], false

== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(dt#24L, 200)
      +- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
         +- *Sort [account#23L ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(account#23L, 200)
               +- Scan ExistingRDD[amt#22L,account#23L,dt#24L]

正如你所看到的，优化计划和物理计划是相同的。
这是隐藏的数据帧（以及隐藏的RDD），因此优化器在执行计划上的工作方式主要是相同的。我很高兴有人代替我写答案：）我想我老了谢谢！这也是一个很好的解释方式。我不认为我可以用解释计划来解释通过管道的事情；让我心烦意乱@subramaniamramamasubramanian生成数据集
/数据帧
的所有东西都有一个计划，因此有一个解释。在任何情况下，您都可以参考文档。
== Parsed Logical Plan ==
'Project [*, 'sum('amt) windowspecdefinition('dt, unspecifiedframe$()) AS dt_sum#150]
+- 'UnresolvedRelation `SQLTransformer_4318bd7007cefbf17a97_826abb6c003c`

== Analyzed Logical Plan ==
amt: bigint, account: bigint, dt: bigint, acc_sum: bigint, dt_sum: bigint
Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L]
+- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, dt_sum#150L, dt_sum#150L]
   +- Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
      +- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
         +- SubqueryAlias sqltransformer_4318bd7007cefbf17a97_826abb6c003c
            +- Project [amt#22L, account#23L, dt#24L, acc_sum#120L]
               +- Project [amt#22L, account#23L, dt#24L, acc_sum#120L, acc_sum#120L]
                  +- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
                     +- Project [amt#22L, account#23L, dt#24L]
                        +- SubqueryAlias sqltransformer_4688bba599a7f5a09c39_f5e9d251099e
                           +- LogicalRDD [amt#22L, account#23L, dt#24L], false

== Optimized Logical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
   +- LogicalRDD [amt#22L, account#23L, dt#24L], false

== Physical Plan ==
Window [sum(amt#22L) windowspecdefinition(dt#24L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS dt_sum#150L], [dt#24L]
+- *Sort [dt#24L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(dt#24L, 200)
      +- Window [sum(amt#22L) windowspecdefinition(account#23L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS acc_sum#120L], [account#23L]
         +- *Sort [account#23L ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(account#23L, 200)
               +- Scan ExistingRDD[amt#22L,account#23L,dt#24L]