Scala 统计word在文件中的出现次数_Scala

Scala 统计word在文件中的出现次数

scala

Scala 统计word在文件中的出现次数,scala,Scala,下面的代码试图计算“Apple”在HTML文件中出现的次数 object Question extends App { def validWords(fileSentancesPart: List[String], wordList: List[String]): List[Option[String]] = fileSentancesPart.map(sentancePart => { if (isWordContained(wordList, sentance

下面的代码试图计算“Apple”在HTML文件中出现的次数

object Question extends App {

  def validWords(fileSentancesPart: List[String], wordList: List[String]): List[Option[String]] =
    fileSentancesPart.map(sentancePart => {
      if (isWordContained(wordList, sentancePart)) {
        Some(sentancePart)
      } else {
        None
      }

    })

    def isWordContained(wordList: List[String], sentancePart: String): Boolean = {

    for (word <- wordList) {
      if (sentancePart.contains(word)) {
        return true;
      }
    }

    false

  }

  lazy val lines = scala.io.Source.fromFile("c:\\data\\myfile.txt" , "latin1").getLines.toList.map(m => m.toUpperCase.split(" ")).flatten

  val vw = validWords(lines,   List("APPLE")) .flatten.size

  println("size is "+vw)


}

上面代码中正在解析的HTML文件（称为“c:\data\myfile.txt”）：

欢迎对上述代码的替代方案提出任何建议

我认为我的问题在于@Jack Leow的评论。代码：

  val fileWords = List("this", "is", "this appleisapple an", "applehere")
  val validWords = List("apple")

  val l: List[String] = validWords(fileWords, validWords).flatten

  println("size : " + l.size)

打印的尺寸是2，当它应该是3时，我认为您应该执行以下操作：

def validWords(
  fileSentancesPart: List[String],
  wordList: List[String]): List[Option[String]] =

  fileSentancesPart /* add flatMap */ .flatMap(_.tails)
    .map(sentancePart => {
      if (isWordContained(wordList, sentancePart)) {
        Some(sentancePart)
      } else {
        None
      }
    })

def isWordContained(
  wordList: List[String],
  sentancePart: String): Boolean = {

  for (word <- wordList) {
    //if (sentancePart.contains(word)) {
    if (sentancePart.startsWith(word)) { // use startsWith
      return true;
    }
  }
  false
}

def validWords(
FileEntanceSpart:List[String]，
单词列表：列表[字符串]：列表[选项[字符串]]=
FileEntanceSpart/*添加flatMap*/.flatMap（u.tails）
.map（sentancePart=>{
if（isWordContained（单词列表，句子部分））{
一些（句子部分）
}否则{
没有一个
}
})
def isWordContained(
wordList:List[String]，
sentancePart:字符串）：布尔={
对于（word，您可以将正则表达式与源代码
迭代器一起使用：
val regex = "([Aa]pple)".r
val count = Source.fromFile("/test.txt").getLines.map(regex.findAllIn(_).length).sum

只要快速浏览一下代码，如果文件中的一行包含两次“APPLE”这个词，这算什么？@JackLeow我想这确实是我的问题，请参阅更新“（？I）APPLE”.r findAllIn“这个苹果”是一个“sizen”案例，您正在寻找另一种方法。s/sentance/句子/g:）
val regex = "([Aa]pple)".r
val count = Source.fromFile("/test.txt").getLines.map(regex.findAllIn(_).length).sum