根据其在Scala和Spark中的频率更换Bigram

根据其在Scala和Spark中的频率更换Bigram,scala,text,apache-spark,Scala,Text,Apache Spark,我想用这个模式替换所有其频率计数大于阈值的bigram(word1.concat(“-”).concat(word2)),我已经尝试了: import org.apache.spark.{SparkConf, SparkContext} object replace { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local") .setAp

我想用这个模式替换所有其频率计数大于阈值的bigram
(word1.concat(“-”).concat(word2))
,我已经尝试了:

import org.apache.spark.{SparkConf, SparkContext}

object replace {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("replace")

    val sc = new SparkContext(conf)
    val rdd = sc.textFile("data/ddd.txt")

    val threshold = 2

    val searchBigram=rdd.map {
      _.split('.').map { substrings =>
        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

          // Remove non-alphanumeric characters and convert to lowercase
          map {
          _.replaceAll( """\W""", "").toLowerCase()
        }.
          sliding(2)

      }.flatMap {
        identity
      }
        .map {
        _.mkString(" ")
      }
        .groupBy {
        identity
      }
        .mapValues {
        _.size
      }
    }.flatMap {
      identity
    }.reduceByKey(_ + _).collect
      .sortBy(-_._2)
      .takeWhile(_._2 >= threshold)
      .map(x=>x._1.split(' '))
      .map(x=>(x(0), x(1))).toVector


    val sample1 = sc.textFile("data/ddd.txt")
    val sample2 = sample1.map(s=> s.split(" ") // split on space
      .sliding(2)                       // take continuous pairs
      .map{ case Array(a, b) => (a, b) }
      .map(elem => if (searchBigram.contains(elem)) (elem._1.concat("-").concat(elem._2)," ") else elem)
      .map{case (e1,e2) => e1}.mkString(" "))
    sample2.foreach(println)
  }
}

但这段代码删除了每个文档的最后一个字,并在包含大量文档的文件上运行时显示了一些错误

假设我的输入文件包含以下文档:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring two thousand issue moody audio mortgage backed.

omg left gotta wrap review order asap . understand issue moody hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

buffered lightning two thousand volts cables burned revivification place .

cables volts cables finally able hear auditory issue moody gem long rumored music .
我最喜欢的输出是:

surprise heard thump opened door small-man clasping package wrapped.

upgrading system found review spring two-thousand issue-moody audio mortgage backed.

omg left gotta wrap review order asap . understand issue-moody hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long small-man .

buffered lightning two-thousand volts-cables burned revivification place .

cables volts-cables finally able hear auditory issue-moody gem long rumored music .
有人能帮我吗?

匙喂:

def getNgrams(sentence):
    out = []
    sen = sentence.split(" ")
    for k in range(len(sen)-1):
        out.append((sen[k],sen[k+1]))
    return out    
if __name__ == '__main__':

    try:
        lsc = LocalSparkContext.LocalSparkContext("Recommendation","spark://BigData:7077")
        sc = lsc.getBaseContext()
        ssc = lsc.getSQLContext()
        inFile = "bigramstxt.txt"
        sen = sc.textFile(inFile,1)
        v = 1
        brv = sc.broadcast(v)
        wordgroups = sen.flatMap(getNgrams).map(lambda t: (t,1)).reduceByKey(add).filter(lambda t: t[1]>brv.value)
        bigrams = wordgroups.collect()
        sc.stop()
        inp = open(inFile,'r').read()
        print inp
        for b in bigrams:
            print b
            inp = inp.replace(" ".join(b[0]),"-".join(b[0]))

        print inp

    except:
        raise
        sc.stop()
 case class Bigram(first: String, second: String) {

 def mkReplacement(s:String) = s.replaceAll(first + " " + second, first + "-" + second)
  }

 val data = List(
"surprise heard thump opened door small seedy man clasping package wrapped",
"upgrading system found review spring two thousand issue moody audio mortgage backed",
"omg left gotta wrap review order asap",
"understand issue moody hand delivered dali lama",
"speak hands wear earplugs lives . listen maintain link long",
"buffered lightning two thousand volts cables burned revivification place",
"cables volts cables finally able hear auditory issue moody gem long rumored music")

def stringToBigrams(s: String) = {
    val words = s.split(" ")
    if (words.size >= 2) {
      words.sliding(2).map(a => Bigram(a(0), a(1)))
    } else
      Iterator[Bigram]()
  }

val bigrams = data.flatMap { stringToBigrams }
//use reduceByKey rather than groupBy for Spark
val bigramCounts = bigrams.groupBy(identity).mapValues(_.size)

val threshold = 2
val topBigrams = bigramCounts.collect{case (b, c) if c >= threshold => b}

val replaced = data.map(r => 
      topBigrams.foldLeft(r)((r, b) => b.mkReplacement(r)))

replaced.foreach(println)
//> surprise heard thump opened door small seedy man clasping package wrapped
//| upgrading system found review spring two-thousand issue-moody audio mortgage backed
//| omg left gotta wrap review order asap
//| understand issue-moody hand delivered dali lama
//| speak hands wear earplugs lives . listen maintain link long
//| buffered lightning two-thousand volts-cables burned revivification place
//| cables volts-cables finally able hear auditory issue-moody gem long rumored music

“在包含大量文档的文件上运行时显示一些错误。”。什么错误?scala.MatchError:[Ljava.lang.String;@6803a136(属于[Ljava.lang.String;)在replace$$anonfun$8$$anonfun$apply$7.apply(replace.scala:74)在replace$$anonfun$8$$anonfun$apply$7.apply(replace.scala:74)在代码的哪一行?而且,该代码似乎(有点)对我来说,最上面的二元图会被替换,但由于您的算法,该对中的第二个仍然在滑动(2)对的下一个条目中,因此“伏特电缆”变为“伏特电缆”因此,你的替换方法需要改变。但是它为我显示了一些错误,并删除了最后的单词。你能帮助我吗?我认为你自己编码和调试的时候了。你的算法是行不通的(因为当你在源代码中得到一个B C,并且(A,B)是你要替换的一个二元模型时,你考虑A,B。(并替换它)然后b,c(不要替换它),这样你就得到了“a-b b c”)。你不能使用滑动(2)。对于python应该是+1,但是-1不是OP的语言抵消了这一点。