Scala/Java-库解析一些文本并删除标点符号？_Java_Scala_Text_Punctuation

Scala/Java-库解析一些文本并删除标点符号？

java scala text

Scala/Java-库解析一些文本并删除标点符号？,java,scala,text,punctuation,Java,Scala,Text,Punctuation,我使用Java中的BreakIterator实现从字符串中删除标点符号。我需要在Scala中重写它，所以我认为这可能是一个用更好的库替换它的好机会（我的实现非常幼稚，我确信它在边缘情况下失败）是否存在可能有用的此类库编辑：以下是我在Scala中的快速解决方案： private val getWordsFromLine = (line: String) => { line.split(" ") .map(_.toLowerCase()) .ma

我使用Java中的

BreakIterator

实现从字符串中删除标点符号。我需要在Scala中重写它，所以我认为这可能是一个用更好的库替换它的好机会（我的实现非常幼稚，我确信它在边缘情况下失败）

是否存在可能有用的此类库

编辑：以下是我在Scala中的快速解决方案：

  private val getWordsFromLine = (line: String) => {
    line.split(" ")
        .map(_.toLowerCase())
        .map(word => word.filter(Character.isLetter(_)))
        .filter(_.length() > 1)
        .toList
  }

给出这个

列表[String]

（每行一个…是的…这是圣经-它是很好的测试用例）：

摩西的第二本书，叫做出埃及记

第一章1这些是以色列人的名字来到埃及；各人和他的家人都跟着雅各来了。2. 流便，西缅，利未，犹大，3以萨迦，西布伦，便雅悯，4 但，拿弗他利，迦得，亚设

您会得到一个

列表[String]

，如下所示：

List(the, second, book, of, moses, called, exodus, chapter, now, these, are, the, names, of, the, children, of, israel, which, came, into, egypt, every, man, and, his, household, came, with, jacob, reuben, simeon, levi, and, judah, issachar, zebulun, and, benjamin, dan, and, naphtali, gad, and, asher)

下面是一种使用正则表达式的方法。不过，它还没有过滤单个字符的单词

val s = """
THE SECOND BOOK OF MOSES, CALLED EXODUS

CHAPTER 1 1 Now these [are] the names of the children of Israel,
which came into Egypt; every man and his household came with
Jacob. 2 Reuben, Simeon, Levi, and Judah, 3 Issachar, Zebulun,
and Benjamin, 4 Dan, and Naphtali, Gad, and Asher.
"""

/* \p{L} denotes Unicode letters */
var items = """\b\p{L}+\b""".r findAllIn s

println(items.toList)
  /* List(THE, SECOND, BOOK, OF, MOSES, CALLED, EXODUS,
          CHAPTER, Now, these, are, the, names, of, the,
          children, of, Israel, which, came, into, Egypt,
          every, man, and, his, household, came, with,
          Jacob, Reuben, Simeon, Levi, and, Judah,
          Issachar, Zebulun, and, Benjamin, Dan, and,
          Naphtali, Gad, and, Asher)
  */

/* \w denotes word characters */
items = """\b\w+\b""".r findAllIn s
println(items.toList)
  /* List(THE, SECOND, BOOK, OF, MOSES, CALLED, EXODUS,
          CHAPTER, 1, 1, Now, these, are, the, names, of,
          the, children, of, Israel, which, came, into,
          Egypt, every, man, and, his, household, came,
          with, Jacob, 2, Reuben, Simeon, Levi, and, Judah,
          3, Issachar, Zebulun, and, Benjamin, 4, Dan, and,
          Naphtali, Gad, and, Asher)
  */

描述了单词边界

\b

，正则表达式的Javadoc是。

对于这种特殊情况，我使用正则表达式

def toWords(lines: List[String]) = lines flatMap { line =>
  "[a-zA-Z]+".r findAllIn line map (_.toLowerCase)
}

为什么不在Scala中使用Java实现呢？两者是可互操作的。您仍然可以在JavaAPI周围添加一些Scala特性，使其更易于使用。我只是不想在没有必要的情况下重写它。通过提供示例来说明您正在寻找的内容会有所帮助。根据你目前的描述，我认为正则表达式应该做这项工作。