在JAVA中使用Stanford nlp提取基于词性标记句的语言结构_Java_Nlp_Stanford Nlp

在JAVA中使用Stanford nlp提取基于词性标记句的语言结构

java nlp stanford-nlp

在JAVA中使用Stanford nlp提取基于词性标记句的语言结构,java,nlp,stanford-nlp,Java,Nlp,Stanford Nlp,我是自然语言处理（NLP）的新手，我想做词性标注（POS），然后在文本中找到特定的结构。我可以使用斯坦福NLP管理词性标注，但我不知道如何提取此结构： NN/NNS+IN+DT+NN/NNS/NNP/NNPS publicstaticvoidmain（字符串args[]）引发异常{ //输入文件字符串contentFilePath=“”； //输出文件 String triplesFilePath=contentFilePath.substring（0，contentFilePath.leng

我是自然语言处理（NLP）的新手，我想做词性标注（POS），然后在文本中找到特定的结构。我可以使用斯坦福NLP管理词性标注，但我不知道如何提取此结构：

NN/NNS+IN+DT+NN/NNS/NNP/NNPS

publicstaticvoidmain（字符串args[]）引发异常{
//输入文件
字符串contentFilePath=“”；
//输出文件
String triplesFilePath=contentFilePath.substring（0，contentFilePath.length（）-4）+“_postag.txt”；
//文档到词性标记
字符串内容=getFileContent（contentFilePath）；
Properties props=新属性（）；
props.setProperty（“注释器”、“标记化、ssplit、pos”）；
StanfordCoreNLP管道=新的StanfordCoreNLP（道具）；
//为文档添加注释。
注释单据=新注释（内容）；
管道注释（doc）；
//为文档添加注释。
列出句子=doc.get（coreanotations.SentencesAnnotation.class）；
for（CoreMap句子：句子）{
for（CoreLabel标记：句子.get（CoreAnnotations.TokensAnotation.class））{
String word=token.get（CoreAnnotations.TextAnnotation.class）；
//这是令牌的POS标记
String pos=token.get（CoreAnnotations.PartOfSpeechAnnotation.class）；
系统输出打印项次（word+“/”+pos）；
} }}}

您只需重复您的句子并检查POS标签即可。如果它们符合您的需求，您可以提取此结构。其代码可能如下所示：

for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) { 
    List<CoreLabel> tokens = sentence.get(TokensAnnotation.class);
    for(int i = 0; i < tokens.size() - 3; i++) {
        String pos = tokens.get(i).get(PartOfSpeechAnnotation.class);
        if(pos.equals("NN") || pos.equals("NNS")) {
            pos = tokens.get(i + 1).getString(PartOfSpeechAnnotation.class);
            if(pos.equals("IN")) {
                pos = tokens.get(i + 2).getString(PartOfSpeechAnnotation.class);
                if(pos.equals("DT")) {
                    pos = tokens.get(i + 3).getString(PartOfSpeechAnnotation.class);
                    if(pos.contains("NN")) {
                        //We have a match starting at index i and ending at index i + 3
                        String word1 = tokens.get(i).getString(TextAnnotation.class);
                        String word2 = tokens.get(i + 1).getString(TextAnnotation.class);
                        String word3 = tokens.get(i + 2).getString(TextAnnotation.class);
                        String word4 = tokens.get(i + 3).getString(TextAnnotation.class);
                        System.out.println(word1 + " " + word2 + " " + word3 + " " + word4);
                    }
                }
            }
        }
    }   
}

for（CoreMap语句：doc.get（coreanotations.SentencesAnnotation.class））{
列表标记=句子.get（TokensAnnotation.class）；
for（int i=0；i

我刚刚意识到，限定词的POS标签是“DT”，而不是“DET”。我在下面更正了我的答案，它现在起作用了。斯坦福大学的POS标签取自。它们将限定符的标记指定为“DT”。没有“DET”标签。

for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) { 
    List<CoreLabel> tokens = sentence.get(TokensAnnotation.class);
    for(int i = 0; i < tokens.size() - 3; i++) {
        String pos = tokens.get(i).get(PartOfSpeechAnnotation.class);
        if(pos.equals("NN") || pos.equals("NNS")) {
            pos = tokens.get(i + 1).getString(PartOfSpeechAnnotation.class);
            if(pos.equals("IN")) {
                pos = tokens.get(i + 2).getString(PartOfSpeechAnnotation.class);
                if(pos.equals("DT")) {
                    pos = tokens.get(i + 3).getString(PartOfSpeechAnnotation.class);
                    if(pos.contains("NN")) {
                        //We have a match starting at index i and ending at index i + 3
                        String word1 = tokens.get(i).getString(TextAnnotation.class);
                        String word2 = tokens.get(i + 1).getString(TextAnnotation.class);
                        String word3 = tokens.get(i + 2).getString(TextAnnotation.class);
                        String word4 = tokens.get(i + 3).getString(TextAnnotation.class);
                        System.out.println(word1 + " " + word2 + " " + word3 + " " + word4);
                    }
                }
            }
        }
    }   
}