Scala中的MapReduce示例_Scala_Apache Spark_Count_Mapreduce_Word

Scala中的MapReduce示例

scala apache-spark mapreduce

Scala中的MapReduce示例,scala,apache-spark,count,mapreduce,word,Scala,Apache Spark,Count,Mapreduce,Word,我有这个问题在Scala做作业。我曾有过但未能成功实施的想法是迭代每个单词，如果单词是basketball，则取下一个单词并将其添加到地图中。按键减少，并从最高到最低排序不幸的是，我不知道如何在单词列表中记下下一个单词。例如，我想这样做： val lines = spark.textFile("basketball_words_only.txt") // process lines in file // split into individual words val words = l

我有这个问题在Scala做作业。我曾有过但未能成功实施的想法是

迭代每个单词，如果单词是basketball，则取下一个单词并将其添加到地图中。按键减少，并从最高到最低排序

不幸的是，我不知道如何在单词列表中记下下一个单词。例如，我想这样做：

val lines = spark.textFile("basketball_words_only.txt") // process lines in file

// split into individual words
val words = lines.flatMap(line => line.split(" "))

var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word

val it = Iterator(words)  

while (it.hasNext) {
  listBuff += it.next().next() // <-- this is what I would like to do    
}

val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer

val sort = count.sortBy(_._2,false,1)

val result2 = sort.collect()

for (i <- 0 to result2.length - 1) {
 printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}

val lines=spark.textFile（“basketball\u words\u only.txt”）//处理文件中的行
//分成几个单词
val words=lines.flatMap（line=>line.split（“”）
var listBuff=new ListBuffer[String]（）//保存以下每个单词的列表缓冲区
val it=迭代器（字）
while（it.hasNext）{
listBuff+=it.next（）.next（）/（单词，1））
val count=follows.reduceByKey（（x，y）=>x+y）//另一个问题，因为我无法使用listBuffer来reduceByKey
val sort=count.sortBy（u._2，false，1）
val result2=sort.collect（）
对于（i这是来自：
正如您所见，它统计单个单词的出现，因为键值对的形式是（单词，1）。您需要更改哪个部分来统计单词的组合
这可能会对您有所帮助：
您可以通过几个步骤获得所有不同单词对中第一个单词的最大计数：
去掉标点符号，将内容拆分为小写的单词
使用滑动（2）

创建单词对数组

使用

reduceByKey

计算不同单词对的出现次数

再次使用

reduceByKey

捕获第一个单词最大计数的单词对

示例代码如下：

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.rdd.RDDFunctions._

val wordPairCountRDD = sc.textFile("/path/to/textfile").
  flatMap( _.split("""[\s,.;:!?]+""") ).
  map( _.toLowerCase ).
  sliding(2).
  map{ case Array(w1, w2) => ((w1, w2), 1) }.
  reduceByKey( _ + _ )

val wordPairMaxRDD = wordPairCountRDD.
  map{ case ((w1, w2), c) => (w1, (w2, c)) }.
  reduceByKey( (acc, x) =>
    if (x._2 > acc._2) (x._1, x._2) else acc
  ).
  map{ case (w1, (w2, c)) => ((w1, w2), c) }

[更新]

如果您只需要根据修改后的要求对单词对计数进行排序（按降序），则可以跳过步骤4，在

单词对计数上使用排序方式
：
wordPairCountRDD.
  sortBy( z => (z._2, z._1._1, z._1._2), ascending = false )

嗯，我的课文用“b”代替“basketball”，用“a”，“c”代替其他单词
scala> val r = scala.util.Random 
scala> val s = (1 to 20).map (i => List("a", "b", "c")(r.nextInt (3))).mkString (" ")
s: String = c a c b a b a a b c a b b c c a b b c b

通过拆分、滑动、筛选、映射、groupBy、map和sortBy获得结果：
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2) 
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))

在小步骤中，滑动：
scala> val counts = s.split (" ").sliding (2).toList
counts: List[Array[String]] = List(Array(c, a), Array(a, c), Array(c, b), Array(b, a), Array(a, b), Array(b, a), Array(a, a), Array(a, b), Array(b, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, b))

过滤器：
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").toList
counts: List[Array[String]] = List(Array(b, a), Array(b, a), Array(b, c), Array(b, b), Array(b, c), Array(b, b), Array(b, c))

映射（u1））（数组访问元素2）
groupBy（u0））
要更改列表的大小：
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)

最后，按降序排序：
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2) 
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))

@Eee，请查看我对您的问题的最新答案以及修改后的要求。
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0))
counts: scala.collection.immutable.Map[Char,List[String]] = Map(b -> List(b, b), a -> List(a, a), c -> List(c, c, c))

scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)

scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2) 
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))