Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何在spark RDD中不使用combineByKey和aggregateByKey获得指定的输出_Apache Spark_Hadoop_Bigdata_Rdd - Fatal编程技术网

Apache spark 如何在spark RDD中不使用combineByKey和aggregateByKey获得指定的输出

Apache spark 如何在spark RDD中不使用combineByKey和aggregateByKey获得指定的输出,apache-spark,hadoop,bigdata,rdd,Apache Spark,Hadoop,Bigdata,Rdd,以下是我的数据: val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D") 现在,我想要以下类型的输出,但不使用combineByKey和aggregateebykey: 1) Array[(String, Int)] = Array((foo,5), (bar,3)) 2) Array((foo,Set(B, A)), (bar,Set(C,

以下是我的数据:

val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D")  
现在,我想要以下类型的输出,但不使用
combineByKey
aggregateebykey

1) Array[(String, Int)] = Array((foo,5), (bar,3))  
2) Array((foo,Set(B, A)),
(bar,Set(C, D)))  
以下是我的尝试:

scala> val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C",
     | "bar=D", "bar=D")  
scala> val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))
sample: Array[(String, String)] = Array((foo,A), (foo,A), (foo,A), (foo,A), (foo,B), (bar,C), (bar,D), (bar,D))  
现在,当我键入变量名后跟tab以查看映射RDD的适用方法时,我可以看到以下选项,其中没有一个可以满足我的要求:

scala> sample.
apply          asInstanceOf   clone          isInstanceOf   length         toString       update         

那么我如何才能做到这一点呢?

这里有一个标准方法

注意事项:您需要使用RDD。我认为这是瓶颈

给你:

val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C","bar=D", "bar=D") 

val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))

val sample2 = sc.parallelize(sample.map(x => (x._1, 1)))
val sample3 = sample2.reduceByKey(_+_) 
sample3.collect()

val sample4 = sc.parallelize(sample.map(x => (x._1, x._2))).groupByKey()   
sample4.collect()

val sample5 = sample4.map(x => (x._1, x._2.toSet))
sample5.collect()

您似乎使用的是
数组
,而不是
RDD
。要解决Spark中
RDD
的问题,需要执行某种groupby+聚合,例如使用
aggregateByKey
。我不确定是否遵循此处。我不想使用aggregateByKey或combineByKey