Scala 文本中单词之间的最小绝对距离

Scala 文本中单词之间的最小绝对距离,scala,functional-programming,match,distance,min,Scala,Functional Programming,Match,Distance,Min,我试图找到给定文本中多个单词之间的最小距离 让我们假设我有一个字符串,比如:“一只b猫狗x y z n m p狐狸x一只b猫狗”” 查找所有子字符串匹配的最小距离:(狐狸、狗、猫) 此文本中的每个子字符串都多次出现: 一开始: 四类 狗-8 福克斯-24 距离=24-4=20 在字符串的末尾有一个: 福克斯-24 狗-30 猫-38 距离=38-24=14 最小距离=14 这就是我提出的算法: object MinKWindowSum { def main(args: Array[

我试图找到给定文本中多个单词之间的最小距离

让我们假设我有一个字符串,比如:“一只b猫狗x y z n m p狐狸x一只b猫狗”

查找所有子字符串匹配的最小距离:(狐狸、狗、猫)

此文本中的每个子字符串都多次出现:

  • 一开始:
四类 狗-8 福克斯-24

距离=24-4=20

  • 在字符串的末尾有一个:
福克斯-24 狗-30 猫-38

距离=38-24=14

最小距离=14

这就是我提出的算法:

 object MinKWindowSum {
  def main(args: Array[String]): Unit = {   

    val document =
      """This Hello World is a huge text with thousands
Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features
Java Scala AXVX TXZX ASDQWE OWEQ World asb eere qwerer
asdasd Scala Java Hello docs World KLKM NWQEW ZXCASD OPOOIK Scala ASDSA
"""
    println(getMinWindowSize(document, "Hello World Scala"))
  }   

  def getMinWindowSize(str:String, s:String): Int = {

    /* creates a list of tuples List[(String, Int)] which contains each keyword and its
    respective index found in the text sorted in order by index.
    */
    val keywords = s.split(" ").toSet
    val idxs = keywords.map(k => (k -> ("(?i)\\Q" + k + "\\E").r.findAllMatchIn(str).map(_.start)))
      .map{ case (keyword,itr) => itr.map((keyword, _))}
      .flatMap(identity).toSeq
      .sortBy(_._2)

    // Calculates the min window on the next step.
    var min = Int.MaxValue
    var minI, minJ = -1

    // current window indexes and words
    var currIdxs = ListBuffer[Int]()
    var currWords = ListBuffer[String]()

    for(idx <- idxs ) {

      // check if word exists in window already
      val idxOfWord = currWords.indexOf(idx._1)

      if (!currWords.isEmpty && idxOfWord != -1) {
        currWords = currWords.drop(idxOfWord + 1)
        currIdxs = currIdxs.drop(idxOfWord + 1)
      }
      currWords += idx._1
      currIdxs += idx._2

      // if all keys are present check if it is new min window
      if (keywords.size == currWords.length) {
        val currMin = Math.abs(currIdxs.last - currIdxs.head)
        if (min > currMin) {
          min = currMin
          minI = currIdxs.head
          minJ = currIdxs.last
        }
      }
    }

    println("min = " + min + " ,i = " + minI + " j = " + minJ)
    min
  }

}
objectminkwindowsum{
def main(args:Array[String]):Unit={
val文件=
“这个Hello World是一个巨大的文本,有数千个
Java的HelloWords和Scala其他行和World以及许多其他Hello文档
世界上许多语言中的单词你好和特点
Java Scala AXVX TXZX ASDQWE OWEQ世界asb eere QWEER
asdasd Scala Java Hello docs World KLKM NWQEW ZXCASD OPOIK Scala ASDSA
"""
println(getMinWindowsSize(文档“Hello World Scala”))
}   
def GetMinWindowsSize(str:String,s:String):Int={
/*创建元组列表[(字符串,Int)]的列表,其中包含每个关键字及其
在按索引排序的文本中找到相应的索引。
*/
val关键字=s.split(“”).toSet
val idxs=keywords.map(k=>(k->(“(?i)\\Q”+k+“\\E”).r.findAllMatchIn(str.map(u.start)))
.map{case(关键字,itr)=>itr.map((关键字,))}
.flatMap(标识).toSeq
.sortBy(u.u 2)
//计算下一步的最小窗口。
var min=Int.MaxValue
变量minI,minJ=-1
//当前窗口索引和单词
var currIdxs=ListBuffer[Int]()
var currWords=ListBuffer[String]()
用于(idx currMin){
最小=电流最小
minI=currIdxs.head
minJ=currIdxs.last
}
}
}
println(“min=“+min+”,i=“+minI+”j=“+minJ”)
闵
}
}
在上面的示例中,我们试图找到“Hello World Scala”的所有匹配之间的最小距离

索引之间的最短窗口位于索引之间: i=235,j=257-->最小值=22

想知道是否有更好的方法以惯用的方式或在效率、可扩展性、可读性和简单性方面更好地实现这一点?

这里有一个稍微“更实用”的替代方案:

val document =
  """This Hello World is a huge text with thousands Java of Hello words and Scala other lines and World and many other Hello docs
     Words of World in many langs Hello and features Java Scala AXVX TXZX ASDQWE OWEQ World
  """
val WORDS = Set("Hello", "World", "Scala")

var minDistance = document.trim
  .split(" ")
  .foldLeft(List[(String, Int)](), None: Option[Int], 0) {
    case ((words, min, idx), word) if WORDS.contains(word) =>
      val newWords = (word, idx) :: words.filter(_._1 != word)
      if (newWords.map(_._1).toSet == WORDS) { // toSet on only 3 elmts
        var idxes = newWords.map(_._2)
        var dist = idxes.max - idxes.min
        var newMin = min match {
          case None                    => dist
          case Some(min) if min < dist => min
          case _                       => dist
        }
        (newWords, Some(newMin), idx + word.length + 1)
      }
      else {
        (newWords, min, idx + word.length + 1)
      }
    case ((words, min, idx), word) =>
      (words, min, idx + word.length + 1)
  }
  ._2

println(minDistance)

我的方法从一个类似的前提开始,但使用尾部递归辅助方法来搜索索引词

def getMinWindowSize(str :String, s :String) :Int = {
  val keywords = s.split("\\s+").toSet
  val re = "(?i)\\b(" + keywords.mkString("|") + ")\\b"
  val idxs = re.r.findAllMatchIn(str).map(w => w.start -> w.toString).toList

  def dist(input :List[(Int, String)], keys :Set[String]) :Option[Int] = input match {
    case Nil => None
    case (idx, word) :: rest =>
      if (keys(word) && keys.size == 1) Some(idx)
      else dist(rest, keys diff Set(word))
  }

  idxs.tails.collect{
    case (idx, word)::rest => dist(rest, keywords diff Set(word)).map(_ - idx)
  }.flatten.reduceOption(_ min _).getOrElse(-1)
}
没有可变变量或数据结构。我还使用了
选项
,以帮助在没有最小窗口的情况下返回更有意义的值

用法:

getMinWindowSize(document, "Hello World Scala")  //res0: Int = 22
getMinWindowSize(document, "Hello World Scal")   //res1: Int = -1

我认为
something.foldLeft(List[…])((result,x)=>result:+f(x))
也被称为
something.map(f()))
。另外,
something.foldLeft(List[…])((result,x)=>result++x)
在我看来与
something.flatMap(identity)
。为了可读性起见,

getMinWindowSize(document, "Hello World Scala")  //res0: Int = 22
getMinWindowSize(document, "Hello World Scal")   //res1: Int = -1