Ruby，把文本分成句子_Ruby_Regex

Ruby，把文本分成句子

ruby regex

Ruby，把文本分成句子,ruby,regex,Ruby,Regex,根据书中的教程，使用以下代码将文本拆分为句子 def sentences gsub(/\n|\r/, ' ').split(/\.\s*/) end 它是有效的，但是如果有一个换行符，在它之前没有一个句号，例如 Hello. two line sentence and heres the new line 每个句子的开头都有一个“\t”。因此，如果我调用上面句子中的方法，我将得到 ["Hello." "two line sentence /tand heres the new lin

根据书中的教程，使用以下代码将文本拆分为句子

def sentences
    gsub(/\n|\r/, ' ').split(/\.\s*/)
end

它是有效的，但是如果有一个换行符，在它之前没有一个句号，例如

Hello. two line sentence
and heres the new line

每个句子的开头都有一个“\t”。因此，如果我调用上面句子中的方法，我将得到

["Hello." "two line sentence /tand heres the new line"]

任何帮助都将不胜感激！谢谢

将文本拆分成句子的最佳方法是使用。在问题中提供的示例方法中，任何首字母缩略词或名称前缀（如“Mr.”）也将被拆分

RubyGem提供ruby接口。请参阅的说明，然后您可以编写如下代码：

require "stanford-core-nlp"

StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
  'joda-time.jar',
  'xom.jar',
  'stanford-corenlp-3.5.0.jar',
  'stanford-corenlp-3.5.0-models.jar',
  'jollyday.jar',
  'bridge.jar'
]

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit)

text = 'Hello. two line sentence
and heres the new line'
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each{|s| puts "sentence: " + s.to_s}

#output:
#sentence: Hello.
#sentence: two line sentence
#and heres the new line

我想不清楚你在问什么。你到底想做什么，出了什么问题？所以这个方法应该根据句点和空格将文本分成句子。所以调用。上面几行的句子应该是[“你好”，“两行句子，这里是新行”]，但当有新行时，我会得到a/t。基本上是这样。我认为问题的根源可能是制表符已经在这里了。您可以使用更激进的

gsub（/\s+/，“”）

来避免问题。