Scala 如何使用Spark计算文档行中所有同时出现的元素?

Scala 如何使用Spark计算文档行中所有同时出现的元素?,scala,apache-spark,Scala,Apache Spark,假设我有一个文档,文档中有一堆用逗号分隔的短语: love, new, truck, present environment, save, trying, stop, destroying great, environment, save, money, animals, zoo, daughter, fun impressive, loved, speech, inspiration Happy Birthday, brother, years, old save, money, stop,

假设我有一个文档,文档中有一堆用逗号分隔的短语:

love, new, truck, present
environment, save, trying, stop, destroying
great, environment, save, money, animals, zoo, daughter, fun
impressive, loved, speech, inspiration
Happy Birthday, brother, years, old
save, money, stop, spending
new, haircut, love, check it out
现在我想用Spark来计算共现元素的数量。因此,我想看看

{
  (love, new): 2, 
  (new, truck): 1, 
  (love, truck): 1, 
  (truck, present): 1, 
  (new, present): 1,
  (love, present): 1, 
  (great, environment): 1, 
  (environment, save): 2,
  (environment, trying): 1, 
  .... 
  (love, check it out): 1
}
有什么建议吗


我目前已经创建了文档的RDD(我称之为
短语列表RDD
),并且我知道我可以使用
短语列表RDD.flatMap(lambda line:line.split(“,”)
)将行解析为元素,但是我很难想出最后一部分来解决我的问题。如果有人有任何建议,我将不胜感激

从数据框中获取文本行后,您可以将其拆分,并按如下所示计算出现次数:

import scala.collection.mutable

object CoOccurrence {

  val text = Seq("love, new, truck, present", "environment, save, trying, stop, destroying", "great, environment, save, money, animals, zoo, daughter, fun", "impressive, loved, speech, inspiration", "Happy Birthday, brother, years, old", "save, money, stop, spending", "new, haircut, love, check it out")

  def main(args: Array[String]) {
    val cooc = mutable.Map.empty[(String, String), Int]

    text.foreach { line =>
      val words = line.split(",").map(_.trim).sorted
      val n = words.length
      for {
        i <- 0 until n-1
        j <- (i + 1) until n
      } {
        val currentCount = cooc.getOrElseUpdate((words(i), words(j)), 0)
        cooc((words(i), words(j))) = currentCount + 1
      }
    }

    println(cooc)

  }
}
导入scala.collection.mutable
对象共现{
val text=Seq(“爱,新,卡车,礼物”,“环境,保存,尝试,停止,破坏”,“伟大,环境,保存,钱,动物,动物园,女儿,乐趣”,“印象深刻,爱,演讲,灵感”,“生日快乐,兄弟,岁,老”,“保存,钱,停止,消费”,“新,理发,爱,检查它”)
def main(参数:数组[字符串]){
val cooc=mutable.Map.empty[(字符串,字符串),Int]
text.foreach{line=>
val words=line.split(“,”).map(u.trim).sorted
val n=单词长度
为了{
i在拆分之后(我添加了修剪以去除空格),您可以使用
List.combinations(2)
获得两个单词的所有组合。传递到
flatMap
,这将产生一个
RDD[List[String]
,其中每个记录都是大小为2的列表

从那时起,这是一个简单的“字数统计”:


你是对的。我假设他们对算法更感兴趣-但是Spark提供了高级构造,可以简化解决方案并允许其缩放。如果这是建议的行为,我可以删除我的解决方案。由此产生的共现映射在此解决方案中不会缩放。此外,它也不能按照写入Spark to ru的方式进行转换n按比例:“GetOrelsUpdate”不是可以在RDD转换内部执行的操作。
val result: RDD[(List[String], Int)] = phrase_list_RDD
  .map(_.split(",").map(_.trim).toList) // convert records to List[String]
  .flatMap(_.combinations(2))  // take all combinations of two words
  .map((_, 1))                 // prepare for reducing - starting with 1 for each combination
  .reduceByKey(_ + _)          // reduce

// result:
// ... 
// (List(environment, daughter),1)
// (List(save, daughter),1)
// (List(money, stop),1)
// (List(great, environment),1)
// (List(save, stop),2)
// ...