Scala中的MapReduce示例
我有这个问题在Scala做作业。 我曾有过但未能成功实施的想法是 迭代每个单词,如果单词是basketball,则取下一个单词并将其添加到地图中。按键减少,并从最高到最低排序 不幸的是,我不知道如何在单词列表中记下下一个单词。 例如,我想这样做:Scala中的MapReduce示例,scala,apache-spark,count,mapreduce,word,Scala,Apache Spark,Count,Mapreduce,Word,我有这个问题在Scala做作业。 我曾有过但未能成功实施的想法是 迭代每个单词,如果单词是basketball,则取下一个单词并将其添加到地图中。按键减少,并从最高到最低排序 不幸的是,我不知道如何在单词列表中记下下一个单词。 例如,我想这样做: val lines = spark.textFile("basketball_words_only.txt") // process lines in file // split into individual words val words = l
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
val lines=spark.textFile(“basketball\u words\u only.txt”)//处理文件中的行
//分成几个单词
val words=lines.flatMap(line=>line.split(“”)
var listBuff=new ListBuffer[String]()//保存以下每个单词的列表缓冲区
val it=迭代器(字)
while(it.hasNext){
listBuff+=it.next().next()/(单词,1))
val count=follows.reduceByKey((x,y)=>x+y)//另一个问题,因为我无法使用listBuffer来reduceByKey
val sort=count.sortBy(u._2,false,1)
val result2=sort.collect()
对于(i这是来自:
正如您所见,它统计单个单词的出现,因为键值对的形式是(单词,1)。您需要更改哪个部分来统计单词的组合
这可能会对您有所帮助:您可以通过几个步骤获得所有不同单词对中第一个单词的最大计数:
去掉标点符号,将内容拆分为小写的单词
使用滑动(2)
创建单词对数组
reduceByKey
计算不同单词对的出现次数reduceByKey
捕获第一个单词最大计数的单词对import org.apache.spark.sql.functions._
import org.apache.spark.mllib.rdd.RDDFunctions._
val wordPairCountRDD = sc.textFile("/path/to/textfile").
flatMap( _.split("""[\s,.;:!?]+""") ).
map( _.toLowerCase ).
sliding(2).
map{ case Array(w1, w2) => ((w1, w2), 1) }.
reduceByKey( _ + _ )
val wordPairMaxRDD = wordPairCountRDD.
map{ case ((w1, w2), c) => (w1, (w2, c)) }.
reduceByKey( (acc, x) =>
if (x._2 > acc._2) (x._1, x._2) else acc
).
map{ case (w1, (w2, c)) => ((w1, w2), c) }
[更新]
如果您只需要根据修改后的要求对单词对计数进行排序(按降序),则可以跳过步骤4,在单词对计数上使用排序方式
:
wordPairCountRDD.
sortBy( z => (z._2, z._1._1, z._1._2), ascending = false )
嗯,我的课文用“b”代替“basketball”,用“a”,“c”代替其他单词
scala> val r = scala.util.Random
scala> val s = (1 to 20).map (i => List("a", "b", "c")(r.nextInt (3))).mkString (" ")
s: String = c a c b a b a a b c a b b c c a b b c b
通过拆分、滑动、筛选、映射、groupBy、map和sortBy获得结果:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2)
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))
在小步骤中,滑动:
scala> val counts = s.split (" ").sliding (2).toList
counts: List[Array[String]] = List(Array(c, a), Array(a, c), Array(c, b), Array(b, a), Array(a, b), Array(b, a), Array(a, a), Array(a, b), Array(b, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, b))
过滤器:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").toList
counts: List[Array[String]] = List(Array(b, a), Array(b, a), Array(b, c), Array(b, b), Array(b, c), Array(b, b), Array(b, c))
映射(u1))(数组访问元素2)
groupBy(u0))
要更改列表的大小:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)
最后,按降序排序:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2)
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))
@Eee,请查看我对您的问题的最新答案以及修改后的要求。
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0))
counts: scala.collection.immutable.Map[Char,List[String]] = Map(b -> List(b, b), a -> List(a, a), c -> List(c, c, c))
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2)
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))