Scala 解析段落中的单个句子

Scala 解析段落中的单个句子,scala,parsing,nlp,Scala,Parsing,Nlp,我正在尝试构建一个解析器,可以将段落转换为句子列表,但我遇到了一个主要问题。所以我使用斯坦福解析器智能地提取句子,但问题是解析器只存储标记列表,而不是句子本身。如果我的客户希望文本与之前显示的完全一致(包括之前的任何间距),这可能会成为一个很大的问题 有人对我如何解决这个问题有什么建议吗 def prepSentenceStrings(text: String): List[String] = { val mod = text.replace("Sr.", "Sr") // deals



def prepSentenceStrings(text: String): List[String] = {
     val mod = text.replace("Sr.", "Sr") // deals with an edge case
     val doc = new DocumentPreprocessor(new StringReader(mod)) => reconfigureSentence(Sentence.listToString(x))).toList}

def reconfigureSentence(text:String) :String = {
    text.replace(" .", ".").replace(" ,",",").replace(" !", "!").replace("( ","(").replace("< ", "<").replace(" )", ")")
text.replace(“.”,“)。replace(“,”,”)。replace(“!”,“!”)。replace(“,”)。replace(“<”,“使用Stanford NLP执行句子拆分的问题是,它首先标记整个段落,并在这个过程中删除所有空白字符。据我所知,没有办法重建它们,而且总是有可能最终导致句子稍微改动




fukaeri:epic dlwh (master)$ java -Xmx8g -cp target/scala-2.11/epic-assembly-0.4-SNAPSHOT.jar epic.preprocess.SegmentSentences
fukaeri:epic dlwh (master)$ vi qq.txt
fukaeri:epic dlwh (master)$ cat qq.txt
I'm trying to build a parser that can turn a          paragraph into a list of sentences, but I'm running into a major problem. So I'm using the stanford parser to pull out the sentences intelligently, but the issue is that the parser only stores the list of tokens, rather than the sentence itself. This can become very problematic if my client wants the text EXACTLY as it showed up before (including any spacing that was there before.
fukaeri:epic dlwh (master)$ java -Xmx8g -cp target/scala-2.11/epic-assembly-0.4-SNAPSHOT.jar epic.preprocess.SegmentSentences < qq.txt
I'm trying to build a parser that can turn a          paragraph into a list of sentences, but I'm running into a major problem.
So I'm using the stanford parser to pull out the sentences intelligently, but the issue is that the parser only stores the list of tokens, rather than the sentence itself.
This can become very problematic if my client wants the text EXACTLY as it showed up before (including any spacing that was there before.
