Stanford nlp 使用TokenRegex以所需格式获取输出_Stanford Nlp

Stanford nlp 使用TokenRegex以所需格式获取输出

stanford-nlp

Stanford nlp 使用TokenRegex以所需格式获取输出,stanford-nlp,Stanford Nlp,我使用TokensRegex进行基于规则的实体提取。它工作得很好，但我很难获得所需格式的输出。以下代码片段为我提供了以下句子的输出：本月早些时候，特朗普将目标对准了丰田，威胁要对其实施制裁如果这家世界上最大的汽车制造商生产花冠，将收取高额费用墨西哥一家工厂的美国市场汽车我知道如果我使用以下方法迭代令牌： for (CoreLabel token : cm.get(TokensAnnotation.class)) {String word = tok

我使用TokensRegex进行基于规则的实体提取。它工作得很好，但我很难获得所需格式的输出。以下代码片段为我提供了以下句子的输出：

本月早些时候，特朗普将目标对准了丰田，威胁要对其实施制裁如果这家世界上最大的汽车制造商生产花冠，将收取高额费用墨西哥一家工厂的美国市场汽车

我知道如果我使用以下方法迭代令牌：

for (CoreLabel token : cm.get(TokensAnnotation.class))
                    {String word = token.get(TextAnnotation.class);
                            String lemma = token.get(LemmaAnnotation.class);
                            String pos = token.get(PartOfSpeechAnnotation.class);
                            String ne = token.get(NamedEntityTagAnnotation.class);
                            System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", NE=" + ne);
}

我可以得到一个为每个标记提供注释的输出。但是，我使用自己的规则来检测命名实体，有时我会发现在多标记实体中，其中一个单词可能被标记为person，而where多标记表达式应该是一个组织（主要是在组织和位置名称的情况下）

所以我期望的结果是：

MATCHED ENTITY: Donald Trump VALUE: PERSON
MATCHED ENTITY: Toyota VALUE: ORGANIZATION

如何更改上述代码以获得所需的输出？我需要使用自定义注释吗

大约一周前，我制作了一个最新版本的罐子。使用GitHub提供的jar

此示例代码将运行规则并应用适当的ner标记

package edu.stanford.nlp.examples;

import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;

import java.util.*;


public class TokensRegexExampleTwo {

  public static void main(String[] args) {

    // set up properties
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex");
    props.setProperty("tokensregex.rules", "multi-step-per-org.rules");
    props.setProperty("tokensregex.caseInsensitive", "true");

    // set up pipeline
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

    // set up text to annotate
    Annotation annotation = new Annotation("...text to annotate...");

    // annotate text
    pipeline.annotate(annotation);

    // print out found entities
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        System.out.println(token.word() + "\t" + token.ner());
      }
    }
  }
}

我设法获得了所需格式的输出

Annotation document = new Annotation(<Sentence to annotate>);

//use the pipeline to annotate the document we created
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

//Note- I doesn't put environment related stuff in rule file.
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);


CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor
      .createExtractorFromFiles(env, "test_degree.rules");

for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      List<MatchedExpression> matched = extractor.extractExpressions(sentence);
      for(MatchedExpression phrase : matched){
      // Print out matched text and value
      System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
      }
    }

注释文档=新注释（）；
//使用管道对我们创建的文档进行注释
管道注释（文件）；
列出句子=document.get（SentencesAnnotation.class）；
//注意-我没有把环境相关的东西放在规则文件中。
Env Env=TokenSequencePattern.getNewEnv（）；
env.setDefaultStringMatchFlags（NodePattern.CASE_不区分大小写）；
env.setDefaultStringPatternFlags（模式不区分大小写）；
CoreMapExpressionExtractor提取器=CoreMapExpressionExtractor
.createExtractorFromFiles（env，“test_degree.rules”）；
for（CoreMap语句：annotation.get（coreanotations.SentencesAnnotation.class））{
列表匹配=提取器。提取器表达式（句子）；
for（匹配表达式短语：匹配）{
//打印出匹配的文本和值
System.out.println（“匹配实体：“+phrase.getText（）+”值：“+phrase.getValue（）.get（）”）；
}
}

输出：

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     result: Annotate($1, ner, "LOCATION"),

}

匹配实体：技术技能值：技能

你可能想看看我的

希望这有帮助

为那些与类似问题作斗争的人回答我自己的问题。以正确格式获得输出的关键在于如何在规则文件中定义规则。以下是我在规则中更改的内容，以更改输出：

旧规则：

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     result: Annotate($1, ner, "LOCATION"),

}

新规则

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     action: Annotate($1, ner, "LOCATION"),
     result: "LOCATION"

}

定义结果字段的方式定义了数据的输出格式

希望这有帮助

我得到了这个错误：“线程中的异常”main“java.lang.RuntimeException:解析文件时出错：每个组织的多步规则”“原因：java.io.IOException:无法打开“每个组织的多步规则”作为类路径、文件名或URL“我在生成中找不到这个文件”。请帮忙。这是我的规则文件的名称。您应该将其替换为规则文件的名称。感谢您的帮助！你能看看我的表吗？我从昨天起就一直在挣扎，但无法解决这个问题。

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     action: Annotate($1, ner, "LOCATION"),
     result: "LOCATION"

}