如何将OpenNLP与Java结合使用？_Java_Nlp_Pos Tagger_Opennlp

如何将OpenNLP与Java结合使用？

java nlp

如何将OpenNLP与Java结合使用？,java,nlp,pos-tagger,opennlp,Java,Nlp,Pos Tagger,Opennlp,我想给一个英语句子贴上邮戳并做一些处理。我想使用openNLP。我已经安装好了当我执行命令时 I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt 我希望它安装正确现在，我如何从java应用程序内部完成这个postaging呢？我已

我想给一个英语句子贴上邮戳并做一些处理。我想使用openNLP。我已经安装好了

当我执行命令时

I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt

我希望它安装正确

现在，我如何从java应用程序内部完成这个postaging呢？我已经将openNLPtools、jwnl、maxent jar添加到项目中，但是如何调用Postaging？

下面是我收集的一些（旧的）示例代码，下面是现代化的代码：

package opennlp;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;

public class OpenNlpTest {
public static void main(String[] args) throws IOException {
    POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
    PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
    POSTaggerME tagger = new POSTaggerME(model);

    String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
    ObjectStream<String> lineStream =
            new PlainTextByLineStream(new StringReader(input));

    perfMon.start();
    String line;
    while ((line = lineStream.read()) != null) {

        String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
        String[] tags = tagger.tag(whitespaceTokenizerLine);

        POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
        System.out.println(sample.toString());

        perfMon.incrementCounter();
    }
    perfMon.stopAndPrintFinalResult();
}
}

这基本上是从OpenNLP中包含的Postagger工具类开始工作的。

sample.getTags（）

是一个

String

数组，它本身具有标记类型

这需要直接访问培训数据的文件，这真的非常糟糕

为此更新的代码库略有不同（可能更有用）

首先，Maven POM：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.javachannel</groupId>
    <artifactId>opennlp-example</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.opennlp</groupId>
            <artifactId>opennlp-tools</artifactId>
            <version>1.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>[6.8.21,)</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

这段代码实际上并没有测试任何东西——它是一个冒烟测试，如果有的话——但它应该作为一个起点。另一个（潜在的）好处是，如果您还没有下载模型，它会为您下载模型。

URL不再工作。我在第14张幻灯片的

上述答案确实提供了一种使用OpenNLP现有模型的方法，但如果您需要培训自己的模型，以下内容可能会有所帮助：

以下是详细的教程，其中包含完整的代码：

根据您的域，您可以自动或手动构建数据集。手动构建这样一个数据集可能非常痛苦，像这样的工具可以帮助简化这个过程

培训数据格式

培训数据作为文本文件传递，其中每行是一个数据项。行中的每个单词都应以类似“word_LABEL”的格式进行标记，单词和标签名称之间用下划线“\u”分隔

anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category

列车型号

这里的重要类是POSModel，它保存实际的模型。我们用我来做模型制作。下面是从训练数据文件构建模型的代码

public POSModel train(String filepath) {
  POSModel model = null;
  TrainingParameters parameters = TrainingParameters.defaultParams();
  parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");

  try {
    try (InputStream dataIn = new FileInputStream(filepath)) {
        ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
            @Override
            public InputStream createInputStream() throws IOException {
                return dataIn;
            }
        }, StandardCharsets.UTF_8);
        ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

        model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
        return model;
    }
  }
  catch (Exception e) {
    e.printStackTrace();
  }
  return null;

}

公共POS模型列车（字符串文件路径）{
POSModel=null；
TrainingParameters=TrainingParameters.defaultParams（）；
parameters.put（TrainingParameters.ITERATIONS_PARAM，“100”）；
试一试{
try（InputStream dataIn=newfileinputstream（filepath））{
ObjectStream lineStream=新的PlainTextByLineStream（新的InputStreamFactory（））{
@凌驾
公共InputStream createInputStream（）引发IOException{
返回数据输入；
}
}，标准字符集。UTF_8）；
ObjectStream sampleStream=新单词TagSampleStream（lineStream）；
model=postagerme.train（“en”，sampleStream，参数，新的postagerFactory（））；
收益模型；
}
}
捕获（例外e）{
e、 printStackTrace（）；
}
返回null；
}

使用模型进行标记。

最后，我们可以看到如何使用模型标记未看到的查询：

    public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

public void点标记（POSModel模型，字符串输入）{
input=input.trim（）；
POSTaggerME tagger=新的POSTaggerME（型号）；
Sequence[]sequences=tagger.topKSequences（input.split（“”））；
对于（序列s：序列）{
List tags=s.getoutcouts（）；
System.out.println（Arrays.asList（input.split（“”）+“=>”+标记）；
}
}

非常感谢。。我终于上轨道了？你能告诉我在哪里可以找到-NN-MD，VB…和所有这些标签的意思吗？我不知道！我现在正在研究这个问题，因为我刚刚意识到——多亏了你的问题——OpenNLP对于我自己的任务有多大用处。：）如何从这个输出中对名词和形容词进行排序？您应该迁移您的示例代码，因为不应该再通过

POSModelLoader

加载模型了（请参见Javadoc）。相反，您可以使用构造函数

POSModel（InputStream in）

通过引用实际模型文件的

InputStream

加载模型文件。此外，类

POSModelLoader

仅出现在OpenNLP的早期版本中（在1.6.0版本中，最初编写的代码实际上运行正常，包括使用构造函数-它甚至没有被标记为已弃用（尽管PlainTextByLineStream已弃用）您正在使用1.6.0快照吗？无论如何，我更新了代码，使其更符合1.6。谢谢！

anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category

public POSModel train(String filepath) {
  POSModel model = null;
  TrainingParameters parameters = TrainingParameters.defaultParams();
  parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");

  try {
    try (InputStream dataIn = new FileInputStream(filepath)) {
        ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
            @Override
            public InputStream createInputStream() throws IOException {
                return dataIn;
            }
        }, StandardCharsets.UTF_8);
        ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

        model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
        return model;
    }
  }
  catch (Exception e) {
    e.printStackTrace();
  }
  return null;

}

    public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}