Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用spark scala获取年份计数_Scala_Apache Spark_Apache Spark Sql_Rdd - Fatal编程技术网

如何使用spark scala获取年份计数

如何使用spark scala获取年份计数,scala,apache-spark,apache-spark-sql,rdd,Scala,Apache Spark,Apache Spark Sql,Rdd,我有如下电影资料,如下所示, 我应该统计一下每年的电影数量,比如2002,2和2004,1 Littlefield, John (I) x House 2002 Houdyshell, Jayne demon State 2004 Houdyshell, Jayne mall in Manhattan 2002 val data=sc.textFile("..line to file") val dataSplit=data.map(line=>{va

我有如下电影资料,如下所示, 我应该统计一下每年的电影数量,比如2002,2和2004,1

Littlefield, John (I)   x House 2002
Houdyshell, Jayne   demon State 2004
Houdyshell, Jayne   mall in Manhattan   2002

val data=sc.textFile("..line to file")
val dataSplit=data.map(line=>{var d=line.split("\t");(d(0),d(1),d(2))})

我无法理解的是,当我使用dataSplit.take(2)foreach(println)时,我看到d(0)是前两列Littlefield,John(i)是firstname和lastname,d(1)是电影名,如“x House”,d(2)是year。如何获得每年的电影数量?

使用
reduceByKey
以这种方式映射元组

val dataSplit = data
  .map(line => {var d = line.split("\t"); (d(2), 1)}) // (2002, 1)
  .reduceByKey((a, b) => a + b)

// .collect() gives the result: Array((2004,1), (2002,2))