Java OpenNLP句子训练示例_Java_Opennlp_Training Data_Sentence

Java OpenNLP句子训练示例

java

Java OpenNLP句子训练示例,java,opennlp,training-data,sentence,Java,Opennlp,Training Data,Sentence,我试图使用官方OpenNLP网站手册示例来训练新模型，以下是示例： Charset charset = Charset.forName("UTF-8"); ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset); ObjectStream sampleStream = new SentenceSampleStream(lineSt

我试图使用官方OpenNLP网站手册示例来训练新模型，以下是示例：


    Charset charset = Charset.forName("UTF-8");
    ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
    ObjectStream sampleStream = new SentenceSampleStream(lineStream);
    SentenceModel model;
    try {
      model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
    } finally {
      sampleStream.close();
    }
    OutputStream modelOut = null;
    try {
      modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
      model.serialize(modelOut);
    } finally {
      if (modelOut != null) 
      modelOut.close();
    }

问题出在2º线上

    
ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);

帮助告诉我：不赞成。改用PlainTextByLineStream（InputStreamFactory，字符集）。但是我不知道如何使用这个构造函数。我想举一个例子，使用相同的语料库文件使用这个未弃用的构造函数

我已经编写了下一段代码，使用opennlp帮助和2种使用train方法的方法，不推荐的和建议的文档帮助：

    Charset charset = Charset.forName("UTF-8");
    InputStreamFactory inputStreamFactory=null;
    ObjectStream<String> lineStream=null;
    ObjectStream<SentenceSample> sampleStream=null;
    SentenceModel model=null;
    OutputStream modelOut = null;
    try{
        inputStreamFactory=InputStreamFactory.class.newInstance();
        lineStream=new PlainTextByLineStream(inputStreamFactory,charset);
        sampleStream = new SentenceSampleStream(lineStream);
        //The deprecated:
        model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
        //The sugested:
        model = SentenceDetectorME.train("en", sampleStream, new SentenceDetectorFactory(), new TrainingParameters()); 
    } catch (InstantiationException e2){
        e2.printStackTrace();
    } catch (IllegalAccessException e2){
        e2.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    }finally {
        try{
            sampleStream.close();
        } catch (IOException e){
            e.printStackTrace();
        }
    }
    try {
        modelOut = new BufferedOutputStream(new FileOutputStream(new File("modelFile")));
        model.serialize(modelOut);
    } catch (FileNotFoundException e){
        e.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    } finally {
        if (modelOut != null) try{
            modelOut.close();
        } catch (IOException e){
            e.printStackTrace();
        }      
    }

Charset Charset=Charset.forName（“UTF-8”）；
InputStreamFactory InputStreamFactory=null；
ObjectStream lineStream=null；
ObjectStream sampleStream=null；
SentenceModel model=null；
OutputStream modelOut=null；
试一试{
inputStreamFactory=inputStreamFactory.class.newInstance（）；
lineStream=新的明文bylinestream（inputStreamFactory，字符集）；
sampleStream=新句子sampleStream（lineStream）；
//反对者：
model=SentenceDetectorME.train（“en”，sampleStream，true，null，TrainingParameters.defaultParams（））；
//建议如下：
模型=SentenceDetectorME.train（“en”，样本流，新SentenceDetectorFactory（），新培训参数（））；
}捕获（实例化异常e2）{
e2.printStackTrace（）；
}捕获（非法访问异常e2）{
e2.printStackTrace（）；
}捕获（IOE异常）{
e、 printStackTrace（）；
}最后{
试一试{
sampleStream.close（）；
}捕获（IOE异常）{
e、 printStackTrace（）；
}
}
试一试{
modelOut=new BufferedOutputStream（new FileOutputStream（新文件（“modelFile”））；
序列化（modelOut）；
}catch（filenotfounde异常）{
e、 printStackTrace（）；
}捕获（IOE异常）{
e、 printStackTrace（）；
}最后{
如果（modelOut！=null），请尝试{
modelOut.close（）；
}捕获（IOE异常）{
e、 printStackTrace（）；
}      
}

但在这段新代码中，我不知道从哪里获得语料库数据文件。

有什么想法吗？

您必须使用所需的数据文件初始化

inputStreamFactory

，使用

inputStreamFactory = new MarkableFileInputStreamFactory(
        new File("en-sent.train"));