Java 句子检测器的OpenNLP1.5？_Java_Nlp_Opennlp

Java 句子检测器的OpenNLP1.5？

java nlp

Java 句子检测器的OpenNLP1.5？,java,nlp,opennlp,Java,Nlp,Opennlp,现在我有以下代码： SentenceModel sd_model = null; try { sd_model = new SentenceModel(new FileInputStream( "opennlp/models/english/sentdetect/en-sent.bin")); } catch (InvalidFormatException e) { // TODO Auto-generated catch block e.printStack

现在我有以下代码：

SentenceModel sd_model = null;
  try {
   sd_model = new SentenceModel(new FileInputStream(
     "opennlp/models/english/sentdetect/en-sent.bin"));
  } catch (InvalidFormatException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (FileNotFoundException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  SentenceDetectorME mSD = new SentenceDetectorME(sd_model);
  String param = "This is a good senttence.I'm very happy. Who can tell me the truth.And go to school.";
  String[] sents = mSD.sentDetect(param);
  for(String sent : sents){
   System.out.println(sent);
  }

但我得到了以下结果：

This is a good senttence.I'm very happy.
Who can tell me the truth.And go to school.

当然，这不是我们想要的。我怎样才能解决这个问题？thanx.

尝试使用特定语言的句子检测器（opennlp.tools.lang.english.SentenceDetector）。

我认为opennlp提供的句子检测模型不适合您的任务，因为它已经在句子最后标点后面的空白数据上进行过训练，因为这在英语正字法中是相当标准的。英语句子检测器通常用于区分句子结尾标点和缩写、引号等句子中间使用的标点。在所有情况下，你的普通句子检测器都会在句子之间出现某种空白

如果您想使用OpenNLP，我认为最简单的解决方案是预处理数据，在其中添加一个空间，您可以检测到类似

[a-z][.？！][a-z]

的模式。（这种模式显然是不够的，但只是提供一个想法。）没有多少缩写具有Nnnn.Nnnn或Nnnn？nnnnnn这样的格式，因此我打赌，不使用任何比正则表达式更奇特的格式，也可以获得良好的结果，但这取决于数据的外观。或者，您可以使用某种具有自定义模型的标记器来查找这些情况

你也可以训练你自己的句子检测模型，它不需要句子之间的空白，但这看起来在OpenNLP中会很棘手。他们提供的培训计划要求培训数据每行一句话，因此无法避免在句子之间插入空格。

opennlp.tools.lang.english.SentenceDetector也有同样的问题。