Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/spring-mvc/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 火花。修改密钥后保留分区器_Apache Spark_Partitioning - Fatal编程技术网

Apache spark 火花。修改密钥后保留分区器

Apache spark 火花。修改密钥后保留分区器,apache-spark,partitioning,Apache Spark,Partitioning,首先,如果这是一个垃圾问题,很抱歉,我对Spark有点陌生 我试图在Spark中执行一些组操作,并试图在修改RDD的键时避免额外的洗牌 原始RDD是json字符串 简化逻辑我的代码如下所示: case class Key1 (a: String, b: String) val grouped1: RDD[(Key1, String)] = rdd1.keyBy(generateKey1(_)) val grouped2: RDD[(Key1, String)] = rdd2.keyBy(gen

首先,如果这是一个垃圾问题,很抱歉,我对Spark有点陌生

我试图在Spark中执行一些组操作,并试图在修改RDD的键时避免额外的洗牌

原始RDD是json字符串

简化逻辑我的代码如下所示:

case class Key1 (a: String, b: String)

val grouped1: RDD[(Key1, String)] = rdd1.keyBy(generateKey1(_))
val grouped2: RDD[(Key1, String)] = rdd2.keyBy(generateKey2(_))

val joined: RDD[(Key1, (String, String)) = groped1.join(grouped2)
现在我想在键中包含一个新字段,并执行一些reduce操作。所以我有点像:

case class key2 (a: String, b: String, c: String)

val withNewKey: RDD[Key2, (String, String)] = joined.map{ case (key, (val1, val2)) => {
   val newKey = Key2(key.a, key.b, extractWhatever(val2))
   (newKey, (val1, val2))
}}

withNewKey.reduceByKey.....
如果我没记错的话,由于密钥已更改,分区丢失,因此reduce操作可能会洗牌数据,但这没有意义,因为密钥已扩展,不需要洗牌

我错过什么了吗?我怎样才能避免这种混乱


谢谢

您可以使用
mapPartitions
并将
PreserveSpatiting
设置为
true

joined.mapPartitions(
  _.map{ case (key, (val1, val2)) => ... },
  true
)

您可以使用
mapPartitions
并将
PreserveSpatiting
设置为
true

joined.mapPartitions(
  _.map{ case (key, (val1, val2)) => ... },
  true
)