Scala 基于两列的spark连接操作
我正在尝试基于两列连接两个数据集。它一直工作到我使用一列,但由于以下错误而失败 :29:错误:值联接不是org.apache.spark.rdd.rdd[(String,String,(String,String,String,String,Double)]的成员 val finalFact=fact.join(dimensionWithSK).map{case(nk1,nk2,((parts1,parts2,parts3,parts4,amount),(sk,prop1,prop2,prop3,prop4))=>(sk,amount)} 代码:Scala 基于两列的spark连接操作,scala,apache-spark,Scala,Apache Spark,我正在尝试基于两列连接两个数据集。它一直工作到我使用一列,但由于以下错误而失败 :29:错误:值联接不是org.apache.spark.rdd.rdd[(String,String,(String,String,String,String,Double)]的成员 val finalFact=fact.join(dimensionWithSK).map{case(nk1,nk2,((parts1,parts2,parts3,parts4,amount),(sk,prop1,prop2,prop3
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
在这里请求某人的帮助。。
谢谢
Sridhar如果查看join的签名,它将在成对的RDD上工作:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
你有三个。我猜您正在尝试连接元组的前两个元素,因此您需要将三元组映射到一对,其中一对的第一个元素是包含三元组前两个元素的一对,例如,对于任何类型的V1
和V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
这将为您提供一个RDD,其形式为RDD[(String,String),(V1,V2)]
rdd1模式:
字段1、字段2、字段3、字段X
rdd2模式:
第一场,第二场,第三场,第三场
val joinResult=rdd1.join(rdd2,
序列(“字段1”、“字段2”、“字段3”)、“外部”)
joinResult架构:
field1,field2,field3,fieldX,fieldY
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})