Apache spark 使用spark dataframe和自定义分区器连接的技术可以使用python，但不能使用scala？_Apache Spark_Join_Apache Spark Sql_Rdd_Partitioner

Apache spark 使用spark dataframe和自定义分区器连接的技术可以使用python，但不能使用scala？

apache-spark join

Apache spark 使用spark dataframe和自定义分区器连接的技术可以使用python，但不能使用scala？,apache-spark,join,apache-spark-sql,rdd,partitioner,Apache Spark,Join,Apache Spark Sql,Rdd,Partitioner,我最近读了一篇文章，描述了如何自定义数据帧分区 []作者在其中用Python演示了该技术。我使用Scala，该技术看起来是解决扭曲问题的好方法，因此我尝试了类似的方法，我发现当一个人执行以下操作时： - create 2 data frames, D1, D2 - convert D1, D2 to 2 Pair RDDs R1,R2 (where the key is the key you want to join on) - repartition R1,R2 with a cu

我最近读了一篇文章，描述了如何自定义数据帧分区 []作者在其中用Python演示了该技术。我使用Scala，该技术看起来是解决扭曲问题的好方法，因此我尝试了类似的方法，我发现当一个人执行以下操作时：

- create 2 data frames, D1, D2
- convert D1, D2 to 2 Pair RDDs R1,R2 
    (where the key is the key you want to join on)
- repartition R1,R2 with a custom partitioner 'C'
    where 'C' has 2 partitions (p-0,p-1) and 
    stuffs everything in P-1, except keys == 'a' 
- join R1,R2 as R3
- OBSERVE that:
    - partitioner for R3 is 'C' (same for R1,R2) 
    - when printing the contents of each partition of R3  all entries
      except the one keyed by 'a' is in p-1
- set D1' <- R1.toDF 
- set D2' <- R2.toDF

所以，我得出了以下结论。。。这对我来说确实有用。。。但我真的很恼火，因为我无法理解使用Python的文章中的行为：

When one needs to use custom partitioning with Dataframes in Scala one must
drop into RDD's do the join or whatever operation on the RDD, then convert back 
to dataframe. You can't apply the custom partitioner, then convert back to 
dataframe, do your operations, and expect the custom partitioning to work.

现在…我希望我错了！也许有人在Spark内部更专业，可以在这里指导我。我已经写了一个小程序（如下）来说明结果。如果你能让我直说，请提前谢谢

更新

除了说明这个问题的Spark代码之外，我还尝试了一个简化版本，它是用Python编写的原始文章。下面的转换创建一个数据帧，提取其底层RDD并对其进行重新分区，然后恢复数据帧并验证分区器是否丢失

说明问题的Python代码片段

Scala片段说明问题

您使用的是什么版本的Spark？如果是2.x及以上版本，建议改用Dataframe/Dataset API，而不是RDD

与RDD相比，使用上述API要容易得多，并且在Spark的更高版本上性能更好

您可能会发现下面的链接对于如何加入DFs很有用：

一旦获得联接的数据帧，就可以使用下面的链接按列值进行分区，我假设您正在尝试实现：

检查将

partitionBy

方法添加到

数据集

数据帧

API级别的方法

以您的

Emp

和

Dept

对象为例：

class DeptByIdPartitioner extends TypedPartitioner[Dept] {
  override def getPartitionIdx(value: Dept): Int = if (value.deptId.startsWith("a")) 0 else 1
  override def numPartitions: Int = 2
  override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}

class EmpByDepIdPartitioner extends TypedPartitioner[Emp] {
  override def getPartitionIdx(value: Emp): Int = if (value.deptId.startsWith("a")) 0 else 1
  override def numPartitions: Int = 2
  override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}

请注意，我们正在扩展

TypedPartitioner

它是编译时安全的，您将无法使用

emp

partitioner重新划分

person

s的数据集

val spark = SparkBuilder.getSpark()

import org.apache.spark.sql.exchange.implicits._  //<-- addtitonal import
import spark.implicits._

val deptPartitioned = department.repartitionBy(new DeptByIdPartitioner)
val empPartitioned  = employee.repartitionBy(new EmpByDepIdPartitioner)

如果我们加入由同一密钥数据集重新分区的数据，Catalyst将正确识别这一点：

val joined = deptPartitioned.join(empPartitioned, "deptId")

println("Joined:")
val result: Array[(Int, Array[Row])] = joined.rdd.glom().collect().zipWithIndex.map(_.swap)
for (elem <- result) {
  println(s"Partition N ${elem._1}")
  println(s"\t: ${elem._2.toList}")
}

Partition N 0
    : List([a,ant dept,anne])
Partition N 1
    : List([b,badger dept,bob], [c,cat dept,claire], [d,duck dept,dave], [m,mouse dept,moe], [r,rabbit dept,roy], [z,zebra dept,zelda])

val join=deptPartitioned.join（emptpartitioned，“deptId”）
println（“已加入：”）
val结果：Array[（Int，Array[Row]）]=joined.rdd.glom（）.collect（）.zipWithIndex.map（wk.swap）
对于（elem 2.4.3版。是的。大多数情况下，我100%都在数据帧列上。在少数情况下（比如当您需要比数据帧的基于列表达式的重新分区api所提供的更严格的分区控制时），降低到RDD是值得的…事实上，这就是本文（python）的内容基于我链接到的内容…谢谢你的回答。经过几次内部调整后，我成功地向Dataset API添加了一个自定义分区器。这离生产使用还有一段距离，因为我还没有介绍很多情况。现有的分区器实现在很多地方都是硬编码的，因此有很多代码需要重新编写已实现。例如，在ShuffleExchangeExec
中，这里的技巧是一个ExchangeCoordinator，它只接受ShuffleExchangeExec等等。不知道您是否对这样的代码感兴趣；我认为这是Spark当前的一个限制，可能在将来的版本中得到解决。@ChrisBedford@Gelerion-我很乐意e看一看，即使它是一个原型，只是为了看看你做了什么。在github上？这里是：只是一个原型。告诉我，如果缺少什么，我会从更大的项目中删除它。目前，我正在做一个垃圾箱包装分割器，所以我计划花更多的时间来处理这个问题。@Gelerion-非常酷。当你正在学习如何处理的时候要使用catalyst进行此操作，您是否找到了帮助您理解机制的特别好的资源？我一定会看看@您做了什么！我找到了描述catalyst基本概念的文章，但主要是调试和Jira阅读。
class DeptByIdPartitioner extends TypedPartitioner[Dept] {
  override def getPartitionIdx(value: Dept): Int = if (value.deptId.startsWith("a")) 0 else 1
  override def numPartitions: Int = 2
  override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}

class EmpByDepIdPartitioner extends TypedPartitioner[Emp] {
  override def getPartitionIdx(value: Emp): Int = if (value.deptId.startsWith("a")) 0 else 1
  override def numPartitions: Int = 2
  override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}

val spark = SparkBuilder.getSpark()

import org.apache.spark.sql.exchange.implicits._  //<-- addtitonal import
import spark.implicits._

val deptPartitioned = department.repartitionBy(new DeptByIdPartitioner)
val empPartitioned  = employee.repartitionBy(new EmpByDepIdPartitioner)

Dep dataset:
Partition N 0
    : List([a,ant dept])
Partition N 1
    : List([d,duck dept], [c,cat dept], [r,rabbit dept], [b,badger dept], [z,zebra dept], [m,mouse dept])

val joined = deptPartitioned.join(empPartitioned, "deptId")

println("Joined:")
val result: Array[(Int, Array[Row])] = joined.rdd.glom().collect().zipWithIndex.map(_.swap)
for (elem <- result) {
  println(s"Partition N ${elem._1}")
  println(s"\t: ${elem._2.toList}")
}

Partition N 0
    : List([a,ant dept,anne])
Partition N 1
    : List([b,badger dept,bob], [c,cat dept,claire], [d,duck dept,dave], [m,mouse dept,moe], [r,rabbit dept,roy], [z,zebra dept,zelda])