Java StanfordCoreNLP不以我的方式工作_Java_Nlp_Stanford Nlp_Stemming_Lemmatization

Java StanfordCoreNLP不以我的方式工作

java nlp stanford-nlp

Java StanfordCoreNLP不以我的方式工作,java,nlp,stanford-nlp,stemming,lemmatization,Java,Nlp,Stanford Nlp,Stemming,Lemmatization,我使用下面的代码。然而，结果并不是我所期望的。结果是[机器，学习] 但我想得到[机器，学习]。我该怎么做？另外，当我的输入是“最大的更大”时，我想得到像[大，大]这样的结果，但结果只是[最大的更大] （注：我只是在eclipse中添加了这四个jar:joda-time.jar、stanford-corenlp-3.3.1-models.jar、stanford-corenlp-3.3.1.jar、xom.jar我还需要添加一些吗？） import java.util.LinkedList；导入

我使用下面的代码。然而，结果并不是我所期望的。结果是

[机器，学习]

但我想得到

[机器，学习]

。我该怎么做？另外，当我的输入是

“最大的更大”

时，我想得到像

[大，大]

这样的结果，但结果只是

[最大的更大]

（注：我只是在eclipse中添加了这四个jar:

joda-time.jar、stanford-corenlp-3.3.1-models.jar、stanford-corenlp-3.3.1.jar、xom.jar

我还需要添加一些吗？）

import java.util.LinkedList；
导入java.util.List；
导入java.util.Properties；
导入edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation；
导入edu.stanford.nlp.ling.CoreAnnotations.SentencesAnotation；
导入edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation；
导入edu.stanford.nlp.ling.corelab；
导入edu.stanford.nlp.pipeline.Annotation；
导入edu.stanford.nlp.pipeline.StanfordCoreNLP；
导入edu.stanford.nlp.util.CoreMap；
公共级斯坦福德柠檬化工厂{
受保护的StanfordCoreNLP管道；
公共斯坦福德旅店（）{
//创建带有词性标记的StanfordCoreNLP对象属性
//（柠檬化需要）和柠檬化
属性道具；
props=新属性（）；
道具放置（“注释器”、“标记化、ssplit、pos、引理”）；
this.pipeline=新的StanfordCoreNLP（道具）；
}
公共列表元素化（字符串文档文本）
{
列表引理=新的LinkedList（）；
//仅使用给定文本创建空注释
注释文档=新注释（documentText）；
//在此文本上运行所有注释器
本.管道.注释（文件）；
//重复找到的所有句子
列出句子=document.get（SentencesAnnotation.class）；
for（CoreMap句子：句子）{
//迭代句子中的所有标记
for（CoreLabel标记：句子.get（TokensAnnotation.class））{
//检索每个单词的引理并将其添加到
//引理列表
add（token.get（LemmaAnnotation.class））；
}
}
返回引理；
}
//试验
公共静态void main（字符串[]args）{
System.out.println（“启动斯坦福柠檬化工”）；
String text=“机器学习\n”；
StanfordLemmatizer slem=新的StanfordLemmatizer（）；
System.out.println（slem.lemmatize（text））；
}
}

理论上，引理化应该返回一组单词的规范形式（称为“引理”或“中心词”）。然而，这种规范形式并不总是我们直觉所期望的。例如，你期望“学习”产生引理“学习”。但是名词“learning”有引理“learning”，而只有现在进行动词“learning”有引理“learning”。如果出现歧义，lemmatizer应该依赖于来自词性标记的信息

好吧，这就解释了机器学习，但是大的、大的和最大的呢

柠檬化依赖于形态分析。斯坦福形态学课程计算英语单词的基本形式，只删除词形变化（而不是派生词形）。也就是说，它只处理名词复数、代词格和动词结尾，而不处理比较形容词或派生名词之类的事情。它基于John Carroll等人用flex编写的有限状态传感器。我找不到原始版本，但Java版本似乎是

这就是为什么最大的不会产生大的收益

不过，WordNet词汇数据库解析为正确的引理。我通常使用WordNet进行柠檬化任务，到目前为止还没有发现任何重大问题。另外两个正确处理示例的著名工具是

好吧，这就解释了机器学习，但是大的、大的和最大的呢

这就是为什么最大的不会产生大的收益

我不知道为什么它显示

[机器，学习]

而不是

[机器，学习]

？为什么L仍然应该是大写？我猜“learning”的词性标记是“NNP”（专有名词），这就是为什么它返回大写单词的原因。你能打印出POS标签并检查一下吗？对不起！我是这里的新手。我不知道如何打印POS标签。你能告诉我怎么做吗？

token.get（PartOfSpeechAnnotation.class）

它显示了

NNP JJS JJR

。我不知道这意味着什么……你知道如何改变以使学习成为学习吗？我不知道为什么它显示

[机器，学习]

而不是

[机器，学习]

？为什么L仍然应该是大写？我猜“learning”的词性标签是“NNP”（专有名词），这就是wh

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");


        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    // Test
    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "Machine Learning\n";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}