Java OpenNLP句子训练示例

Java OpenNLP句子训练示例,java,opennlp,training-data,sentence,Java,Opennlp,Training Data,Sentence,我试图使用官方OpenNLP网站手册示例来训练新模型,以下是示例: Charset charset = Charset.forName("UTF-8"); ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset); ObjectStream sampleStream = new SentenceSampleStream(lineSt

我试图使用官方OpenNLP网站手册示例来训练新模型,以下是示例:


    Charset charset = Charset.forName("UTF-8");
    ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
    ObjectStream sampleStream = new SentenceSampleStream(lineStream);
    SentenceModel model;
    try {
      model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
    } finally {
      sampleStream.close();
    }
    OutputStream modelOut = null;
    try {
      modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
      model.serialize(modelOut);
    } finally {
      if (modelOut != null) 
      modelOut.close();
    }
问题出在2º线上

    
ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
帮助告诉我: 不赞成。改用PlainTextByLineStream(InputStreamFactory,字符集)。 但是我不知道如何使用这个构造函数。我想举一个例子,使用相同的语料库文件使用这个未弃用的构造函数

我已经编写了下一段代码,使用opennlp帮助和2种使用train方法的方法,不推荐的和建议的文档帮助:

    Charset charset = Charset.forName("UTF-8");
    InputStreamFactory inputStreamFactory=null;
    ObjectStream<String> lineStream=null;
    ObjectStream<SentenceSample> sampleStream=null;
    SentenceModel model=null;
    OutputStream modelOut = null;
    try{
        inputStreamFactory=InputStreamFactory.class.newInstance();
        lineStream=new PlainTextByLineStream(inputStreamFactory,charset);
        sampleStream = new SentenceSampleStream(lineStream);
        //The deprecated:
        model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
        //The sugested:
        model = SentenceDetectorME.train("en", sampleStream, new SentenceDetectorFactory(), new TrainingParameters()); 
    } catch (InstantiationException e2){
        e2.printStackTrace();
    } catch (IllegalAccessException e2){
        e2.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    }finally {
        try{
            sampleStream.close();
        } catch (IOException e){
            e.printStackTrace();
        }
    }
    try {
        modelOut = new BufferedOutputStream(new FileOutputStream(new File("modelFile")));
        model.serialize(modelOut);
    } catch (FileNotFoundException e){
        e.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    } finally {
        if (modelOut != null) try{
            modelOut.close();
        } catch (IOException e){
            e.printStackTrace();
        }      
    }
Charset Charset=Charset.forName(“UTF-8”);
InputStreamFactory InputStreamFactory=null;
ObjectStream lineStream=null;
ObjectStream sampleStream=null;
SentenceModel model=null;
OutputStream modelOut=null;
试一试{
inputStreamFactory=inputStreamFactory.class.newInstance();
lineStream=新的明文bylinestream(inputStreamFactory,字符集);
sampleStream=新句子sampleStream(lineStream);
//反对者:
model=SentenceDetectorME.train(“en”,sampleStream,true,null,TrainingParameters.defaultParams());
//建议如下:
模型=SentenceDetectorME.train(“en”,样本流,新SentenceDetectorFactory(),新培训参数());
}捕获(实例化异常e2){
e2.printStackTrace();
}捕获(非法访问异常e2){
e2.printStackTrace();
}捕获(IOE异常){
e、 printStackTrace();
}最后{
试一试{
sampleStream.close();
}捕获(IOE异常){
e、 printStackTrace();
}
}
试一试{
modelOut=new BufferedOutputStream(new FileOutputStream(新文件(“modelFile”));
序列化(modelOut);
}catch(filenotfounde异常){
e、 printStackTrace();
}捕获(IOE异常){
e、 printStackTrace();
}最后{
如果(modelOut!=null),请尝试{
modelOut.close();
}捕获(IOE异常){
e、 printStackTrace();
}      
}
但在这段新代码中,我不知道从哪里获得语料库数据文件。
有什么想法吗?

您必须使用所需的数据文件初始化
inputStreamFactory
,使用

inputStreamFactory = new MarkableFileInputStreamFactory(
        new File("en-sent.train"));