Apache spark Spark:pivot占用关于列数的二次空间_Apache Spark_Pivot

Apache spark Spark:pivot占用关于列数的二次空间

apache-spark

Apache spark Spark:pivot占用关于列数的二次空间,apache-spark,pivot,Apache Spark,Pivot,我正在运行以下Scala代码： val N = 10 import spark.implicits._ var data = (1 to N).map(i => (1, i, 20.0)).toDF("a", "b", "c") data // this line renames the columns to themselves // and it breaks query

我正在运行以下Scala代码：

    val N = 10
    import spark.implicits._
    var data = (1 to N).map(i => (1, i, 20.0)).toDF("a", "b", "c")
    data
      // this line renames the columns to themselves
      // and it breaks query optimizer and the pivot becomes quadratic
      .toDF(data.columns:_*)
      .groupBy("a")
      .pivot("b")
      .agg(count("c"))
      .explain(true)

因此，我得到了以下计划：

== Parsed Logical Plan ==
'Pivot ArrayBuffer(a#386), 'b, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [count('c)]
+- Project [_1#382 AS a#386, _2#383 AS b#387, _3#384 AS c#388]
   +- LocalRelation [_1#382, _2#383, _3#384]

== Analyzed Logical Plan ==
a: int, 1: bigint, 2: bigint, 3: bigint, 4: bigint, 5: bigint, 6: bigint, 7: bigint, 8: bigint, 9: bigint, 10: bigint
Project [a#386, __pivot_count(`c`) AS `count(``c``)`#467[0] AS 1#468L, __pivot_count(`c`) AS `count(``c``)`#467[1] AS 2#469L, __pivot_count(`c`) AS `count(``c``)`#467[2] AS 3#470L, __pivot_count(`c`) AS `count(``c``)`#467[3] AS 4#471L, __pivot_count(`c`) AS `count(``c``)`#467[4] AS 5#472L, __pivot_count(`c`) AS `count(``c``)`#467[5] AS 6#473L, __pivot_count(`c`) AS `count(``c``)`#467[6] AS 7#474L, __pivot_count(`c`) AS `count(``c``)`#467[7] AS 8#475L, __pivot_count(`c`) AS `count(``c``)`#467[8] AS 9#476L, __pivot_count(`c`) AS `count(``c``)`#467[9] AS 10#477L]
+- Aggregate [a#386], [a#386, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0) AS __pivot_count(`c`) AS `count(``c``)`#467]
   +- Aggregate [a#386, b#387], [a#386, b#387, count(c#388) AS count(`c`)#445L]
      +- Project [_1#382 AS a#386, _2#383 AS b#387, _3#384 AS c#388]
         +- LocalRelation [_1#382, _2#383, _3#384]

== Optimized Logical Plan ==
Aggregate [a#386], [a#386, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[0] AS 1#468L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[1] AS 2#469L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[2] AS 3#470L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[3] AS 4#471L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[4] AS 5#472L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[5] AS 6#473L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[6] AS 7#474L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[7] AS 8#475L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[8] AS 9#476L, pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)[9] AS 10#477L]
+- Aggregate [a#386, b#387], [a#386, b#387, count(1) AS count(`c`)#445L]
   +- LocalRelation [a#386, b#387]

== Physical Plan ==
HashAggregate(keys=[a#386], functions=[pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)], output=[a#386, 1#468L, 2#469L, 3#470L, 4#471L, 5#472L, 6#473L, 7#474L, 8#475L, 9#476L, 10#477L])
+- Exchange hashpartitioning(a#386, 200)
   +- HashAggregate(keys=[a#386], functions=[partial_pivotfirst(b#387, count(`c`)#445L, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0)], output=[a#386, 1#456L, 2#457L, 3#458L, 4#459L, 5#460L, 6#461L, 7#462L, 8#463L, 9#464L, 10#465L])
      +- *(2) HashAggregate(keys=[a#386, b#387], functions=[count(1)], output=[a#386, b#387, count(`c`)#445L])
         +- Exchange hashpartitioning(a#386, b#387, 200)
            +- *(1) HashAggregate(keys=[a#386, b#387], functions=[partial_count(1)], output=[a#386, b#387, count#499L])
               +- LocalTableScan [a#386, b#387]

分析的逻辑计划

看起来很好且有效，但是

优化的逻辑计划

占用了二次空间，当

大约为1000时，火花会随着

OutOfMemoryError

崩溃

在平面图中，您可以看到“数据透视优先”（b#387，count（c

）#445L、1、2、3、4、5、6、7、8、9、10、0、0）[0]

一个这样的条目大小为N，并且有N个这样的实体

我认为文本表示可能效率不高，但在分析Eclipse内存分析器中的堆之后，我可以看到数据结构确实占用了二次

O（N^2）

空间

是否有可能对其进行优化并使其线性化