Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何通过RDD Scala映射join_Scala_Apache Spark_Reduce - Fatal编程技术网

如何通过RDD Scala映射join

如何通过RDD Scala映射join,scala,apache-spark,reduce,Scala,Apache Spark,Reduce,我有一个(id-(name-value)对的列表 val input=sc.parallelize(数组)(数组(1,“a 10”), 数组(1,“b 11”), 数组(3,“a 12”), 数组(3,“b 13”), 阵列(3,“C14”), 阵列(4,“B15”)) 映射阶段has key是id,value是(name value)字符串 valrdd=input.map(x=>(x(0),x(1))) 我的预期结果是:对于每个id,使用f()函数按名称比较周围的值 例如,当id==“3

我有一个(id-(name-value)对的列表

val input=sc.parallelize(数组)(数组(1,“a 10”),
数组(1,“b 11”),
数组(3,“a 12”),
数组(3,“b 13”),
阵列(3,“C14”),
阵列(4,“B15”))
映射阶段has key是id,value是(name value)字符串

valrdd=input.map(x=>(x(0),x(1)))
我的预期结果是:对于每个id,使用f()函数按名称比较周围的值

例如,当id==“3”时,我们在reduce阶段后得到结果:

(key: ab, value: f(12,13))
(key: ac, value: f(12,14))
(key: bc, value: f(13,14))

RDD可以与自己连接以获取所有对,并且通过过滤只剩下所需的行:

// split string value on two parts
val rdd = input.map(x => (x(0), x(1).toString.split(" ")))
  .map({ case (key, parts) => (key, (parts(0), parts(1))) })

// join , filter, and transform to expected
val both = rdd
  .join(rdd)
  .filter({ case (_, (v1, v2)) => v1._1 < v2._1 })
  .map({ case (key, (v1, v2)) => (s"[$key] key: " + v1._1 + v2._1, s"value: f(${v1._2},${v2._2})") })
PS:这里可以使用高级过滤

([1] key: ab,value: f(10,11))
([3] key: ab,value: f(12,13))
([3] key: ac,value: f(12,14))
([3] key: bc,value: f(13,14))