Scala中的MapReduce示例

Scala中的MapReduce示例,scala,apache-spark,count,mapreduce,word,Scala,Apache Spark,Count,Mapreduce,Word,我有这个问题在Scala做作业。 我曾有过但未能成功实施的想法是 迭代每个单词,如果单词是basketball,则取下一个单词并将其添加到地图中。按键减少,并从最高到最低排序 不幸的是,我不知道如何在单词列表中记下下一个单词。 例如,我想这样做: val lines = spark.textFile("basketball_words_only.txt") // process lines in file // split into individual words val words = l

我有这个问题在Scala做作业。 我曾有过但未能成功实施的想法是

迭代每个单词,如果单词是basketball,则取下一个单词并将其添加到地图中。按键减少,并从最高到最低排序

不幸的是,我不知道如何在单词列表中记下下一个单词。 例如,我想这样做:

val lines = spark.textFile("basketball_words_only.txt") // process lines in file

// split into individual words
val words = lines.flatMap(line => line.split(" "))

var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word

val it = Iterator(words)  

while (it.hasNext) {
  listBuff += it.next().next() // <-- this is what I would like to do    
}

val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer

val sort = count.sortBy(_._2,false,1)

val result2 = sort.collect()

for (i <- 0 to result2.length - 1) {
 printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
val lines=spark.textFile(“basketball\u words\u only.txt”)//处理文件中的行
//分成几个单词
val words=lines.flatMap(line=>line.split(“”)
var listBuff=new ListBuffer[String]()//保存以下每个单词的列表缓冲区
val it=迭代器(字)
while(it.hasNext){
listBuff+=it.next().next()/(单词,1))
val count=follows.reduceByKey((x,y)=>x+y)//另一个问题,因为我无法使用listBuffer来reduceByKey
val sort=count.sortBy(u._2,false,1)
val result2=sort.collect()
对于(i这是来自:

正如您所见,它统计单个单词的出现,因为键值对的形式是(单词,1)。您需要更改哪个部分来统计单词的组合


这可能会对您有所帮助:

您可以通过几个步骤获得所有不同单词对中第一个单词的最大计数:

  • 去掉标点符号,将内容拆分为小写的单词
  • 使用滑动(2)
  • 创建单词对数组
  • 使用
    reduceByKey
    计算不同单词对的出现次数
  • 再次使用
    reduceByKey
    捕获第一个单词最大计数的单词对
  • 示例代码如下:

    import org.apache.spark.sql.functions._
    import org.apache.spark.mllib.rdd.RDDFunctions._
    
    val wordPairCountRDD = sc.textFile("/path/to/textfile").
      flatMap( _.split("""[\s,.;:!?]+""") ).
      map( _.toLowerCase ).
      sliding(2).
      map{ case Array(w1, w2) => ((w1, w2), 1) }.
      reduceByKey( _ + _ )
    
    val wordPairMaxRDD = wordPairCountRDD.
      map{ case ((w1, w2), c) => (w1, (w2, c)) }.
      reduceByKey( (acc, x) =>
        if (x._2 > acc._2) (x._1, x._2) else acc
      ).
      map{ case (w1, (w2, c)) => ((w1, w2), c) }
    
    [更新]

    如果您只需要根据修改后的要求对单词对计数进行排序(按降序),则可以跳过步骤4,在
    单词对计数上使用
    排序方式

    wordPairCountRDD.
      sortBy( z => (z._2, z._1._1, z._1._2), ascending = false )
    
    嗯,我的课文用“b”代替“basketball”,用“a”,“c”代替其他单词

    scala> val r = scala.util.Random 
    scala> val s = (1 to 20).map (i => List("a", "b", "c")(r.nextInt (3))).mkString (" ")
    s: String = c a c b a b a a b c a b b c c a b b c b
    
    通过拆分、滑动、筛选、映射、groupBy、map和sortBy获得结果:

    scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2) 
    counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))
    
    在小步骤中,滑动:

    scala> val counts = s.split (" ").sliding (2).toList
    counts: List[Array[String]] = List(Array(c, a), Array(a, c), Array(c, b), Array(b, a), Array(a, b), Array(b, a), Array(a, a), Array(a, b), Array(b, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, b))
    
    过滤器:

    scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").toList
    counts: List[Array[String]] = List(Array(b, a), Array(b, a), Array(b, c), Array(b, b), Array(b, c), Array(b, b), Array(b, c))
    
    映射(u1))(数组访问元素2)

    groupBy(u0))

    要更改列表的大小:

    scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
    counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)
    
    最后,按降序排序:

    scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2) 
    counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))
    

    @Eee,请查看我对您的问题的最新答案以及修改后的要求。
    scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0))
    counts: scala.collection.immutable.Map[Char,List[String]] = Map(b -> List(b, b), a -> List(a, a), c -> List(c, c, c))
    
    scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
    counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)
    
    scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2) 
    counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))