Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark SQL-重命名列会影响分区吗?_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark Spark SQL-重命名列会影响分区吗?

Apache spark Spark SQL-重命名列会影响分区吗?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我编写了一个显式连接API,它使用l#或r#前缀重命名数据集中的列,以消除歧义并解决spark沿袭的问题,即columnName1#77在columnName1#123、columnName2#55中找不到 部分代码如下所示: def explicitJoin(other: Dataset[_], joinExpr: Column, joinType: String): ExplicitJoinExt = { val left = dataset.toDF(dataset.columns.

我编写了一个显式连接API,它使用l#或r#前缀重命名数据集中的列,以消除歧义并解决spark沿袭的问题,即columnName1#77在columnName1#123、columnName2#55中找不到

部分代码如下所示:

 def explicitJoin(other: Dataset[_], joinExpr: Column, joinType: String): ExplicitJoinExt = {
  val left = dataset.toDF(dataset.columns.map("l_" + _): _*)
  val right = other.toDF(other.columns.map("r_" + _): _*)

  new ExplicitJoinExt(left.join(right, joinExpr, joinType))
}
然后,用户可以传递联接表达式,如$“l\U columnName1”===$“r\U columnName1”&&。。。这样他们就可以100%明确地知道他们在连接哪些列

我遇到了一个新问题,分区太大,无法加载到内存中(org.apache.spark.shuffle.FetchFailedException:too large frame…),但是读取输入(分区)数据集没有问题

重命名列是否会影响输入数据集/数据帧的基本分区

编辑

示例1-常规联接

    case class A(a: Int, b: String)

    val l = (0 to 1000000).map(i => A(i, i.toString))
    val r = (0 to 1000000).map(i => A(i, i.toString))

    val ds1 = l.toDF.as[A].repartition(100, $"a")
    val ds2 = r.toDF.as[A].repartition(100, $"a")

    val joined = ds1.join(ds2, Seq("a"), "inner")

    joined.explain

    == Physical Plan ==
    *Project [a#2, b#3, b#15]
    +- *SortMergeJoin [a#2], [a#14], Inner
       :- *Sort [a#2 ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(a#2, 100)
       :     +- LocalTableScan [a#2, b#3]
       +- *Sort [a#14 ASC NULLS FIRST], false, 0
          +- ReusedExchange [a#14, b#15], Exchange hashpartitioning(a#2, 100)
示例2-使用涉及重命名的my(可能被误导的)ExplicitJoinText

    val joined = ds1
      .explicitJoin(ds2, $"l_a" === $"r_a", "inner") // Pimped on conversion to ExplicitJoin type, columns prefixed by l_ or r_. DS joined by expr and join type
      .selectLeft                                    // Select just left prefixed columns
      .toDF                                          // Convert back from ExplicitJoinExpr to DF
      .as[A]

    joined.explain


    == Physical Plan ==
    *Project [l_a#24 AS a#53, l_b#25 AS b#54]
    +- *BroadcastHashJoin [l_a#24], [r_a#29], Inner, BuildRight
       :- *Project [a#2 AS l_a#24, b#3 AS l_b#25]
       :  +- Exchange hashpartitioning(a#2, 100)
       :     +- LocalTableScan [a#2, b#3]
       +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
          +- *Project [a#14 AS r_a#29]
             +- Exchange hashpartitioning(a#14, 100)
                +- LocalTableScan [a#14]

因此,对于第二个连接,我们将再次重新分区-正确吗?

否,我检查了SPARK 2.3.1。重命名不会影响分区,至少在这种方法中不会:

 val ds11 = ds1.repartition(4) 
 val ds11 = ds1.repartition(2, $"cityid")
不,我也查过了。重命名不会影响分区,至少在这种方法中不会:

 val ds11 = ds1.repartition(4) 
 val ds11 = ds1.repartition(2, $"cityid")
解释以下各项的输出:

val j = left.join(right, $"l_personid" === $"r_personid", "inner").explain
​在我的例子中,将2和4显示为分区数:

== Physical Plan ==
*(2) BroadcastHashJoin [l_personid#641], [r_personid#647], Inner, 
BuildRight, false
:- *(2) Project [personid#612 AS l_personid#641, personname#613 AS 
l_personname#642, cityid#614 AS l_cityid#643]
:  +- Exchange hashpartitioning(cityid#614, 2)
:     +- LocalTableScan [personid#612, personname#613, cityid#614]   
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- *(1) Project [personid#612 AS r_personid#647, personname#613 AS r_personname#648, cityid#614 AS r_cityid#649]
      +- Exchange hashpartitioning(personid#612, 4)
         +- LocalTableScan [personid#612, personname#613, cityid#614]
可以看到重命名的col被映射回它们的原始名称

在对其他地方的一篇文章的测试中,我们能够确定依赖聚合或联接的新操作将默认为200,除非

 sqlContext.setConf("spark.sql.shuffle.partitions", "some val")

在将其设置为所需值的代码中发出。如果是一小部分数据被合并,等等,那么结果可能会有所不同。

对于仍然遇到此问题的人:重命名列确实会影响Spark<3.0中的分区

Seq((1, 2))
  .toDF("a", "b")
  .repartition($"b")
  .withColumnRenamed("b", "c")
  .repartition($"c")
  .explain()
给出了以下计划:

== Physical Plan ==
Exchange hashpartitioning(c#40, 10)
+- *(1) Project [a#36, b#37 AS c#40]
   +- Exchange hashpartitioning(b#37, 10)
      +- LocalTableScan [a#36, b#37]

这是固定的。

我担心的是,在连接之前,我在(比如)两个DS上对joinKey1进行分区,但在连接$“l_joinKey1”==$“r_joinKey1”之前,我重命名为l_joinKey1和r_joinKey1”-Spark是否足够聪明,能够意识到数据已经在原始列名上进行了分区和共定位?我检查了场景(在午餐期间)并发现分区是按数字继承的,而不是按默认值200继承的。我推断没有问题。你试过解释吗?关于更改名称之前和之后?这个问题:重命名列是否会影响输入数据集/数据帧的基本划分?我相信已经得到了回答。如解释中所示,重命名的列映射回原始列。我最初应用的分区仍然有效。