StanfordNLP-TokensRegexGenerator.readEntries上的ArrayIndexOutOfBoundsException(TokensRegexGenerator.java:696)

StanfordNLP-TokensRegexGenerator.readEntries上的ArrayIndexOutOfBoundsException(TokensRegexGenerator.java:696),java,nlp,stanford-nlp,Java,Nlp,Stanford Nlp,我想用斯坦福德NLP的TokensRegexGenerator识别以下技能 专业领域 知识领域 计算机技能 技术经验 技术技能 还有更多类似上面的文本序列 代码- Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

我想用斯坦福德NLP的TokensRegexGenerator识别以下技能

专业领域
知识领域
计算机技能
技术经验
技术技能

还有更多类似上面的文本序列

代码-

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
    String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
    List tokens = new ArrayList<>();

    // traversing each sentence from array of sentence.
    for (String txt : tests) {
         System.out.println("String is : " + txt);

         // create an empty Annotation just with the given text
         Annotation document = new Annotation(txt);

         pipeline.annotate(document);
         List<CoreMap> sentences = document.get(SentencesAnnotation.class);

         /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
      for (CoreMap sentence : sentences) {
         for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
             System.out.println("annotated coreMap sentences : " + token);
             // Extracting NER tag for current token
             String ne = token.get(NamedEntityTagAnnotation.class);
             String word = token.get(CoreAnnotations.TextAnnotation.class);
             System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
             System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
             System.out.println("Named Entity : " + ne);
    }
  }
Properties=newproperties();
道具放置(“注释器”、“标记化、ssplit、pos、引理、ner”);
StanfordCoreNLP管道=新的StanfordCoreNLP(道具);
pipeline.addAnnotator(新令牌regexGenerator(“./mapping/test_degree.rule”,true));
String[]tests={“文学学士是个好学位”,“技术技能是软件开发人员必须具备的。”;
List tokens=new ArrayList();
//从句子数组遍历每个句子。
for(字符串txt:测试){
System.out.println(“字符串为:”+txt);
//仅使用给定文本创建空注释
注释文档=新注释(txt);
管道注释(文件);
列出句子=document.get(SentencesAnnotation.class);
/*接下来我们可以浏览注释句子并提取注释单词,
使用CoreLabel对象*/
for(CoreMap句子:句子){
for(CoreLabel标记:句子.get(TokensAnnotation.class)){
System.out.println(“带注释的coreMap语句:“+token”);
//正在提取当前令牌的NER标记
字符串ne=token.get(NamedEntityTagAnnotation.class);
String word=token.get(CoreAnnotations.TextAnnotation.class);
System.out.println(“当前单词:”+Word+“POS:”+token.get(PartOfSpeechAnnotation.class));
System.out.println(“引理:+token.get(LemmaAnnotation.class));
System.out.println(“命名实体:+ne”);
}
}
我的正则表达式规则文件是-

$SKILL_FIRST_KEYWORD=“/area of/|/area of/|/technical/|/computer/|/professional/” $SKILL_KEYWORD=“/knowledge/|/SKILL/|/skills/|/skills/|/experience/”

tokens={type:“CLASS”,value:“edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation”}

{ 规则类型:“令牌”, 模式:($SKILL\u FIRST\u关键字+$SKILL\u关键字), 结果:“技能” }

我收到
ArrayIndexOutOfBoundsException
错误。我想我的规则文件有问题。有人能告诉我哪里出错了吗

期望输出-

专业领域-技能

知识领域-技能

计算机技能-技能

等等


提前感谢。

您应该使用TokensRegexAnnotator,而不是TokensRegexGenerator

您应该查看这些线程以了解更多信息:


以上@StanfordNLPHelp接受的答案帮助我解决了这个问题。所有的功劳都归他/她

我只是想总结一下最终代码是如何获得所需格式的输出的,希望它能帮助某些人

首先,我更改了规则文件

$SKILL_FIRST_关键字=“/area of | area of | Technical | computer | professional/”
$SKILL_KEYWORD=“/knowledge | SKILL | skills | experience/”

然后在代码中

props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

for (String txt : tests) {
     System.out.println("String is : " + txt);

     // create an empty Annotation just with the given text
     Annotation document = new Annotation(txt);

     pipeline.annotate(document);
     List<CoreMap> sentences = document.get(SentencesAnnotation.class);

     Env env = TokenSequencePattern.getNewEnv();
     env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
     env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);

     CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules");
     for (CoreMap sentence : sentences) {
         List<MatchedExpression> matched = extractor.extractExpressions(sentence);
         for(MatchedExpression phrase : matched){
             // Print out matched text and value
             System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
         }
    }
}
props.put(“注释器”、“标记化、ssplit、pos、引理、ner”);
StanfordCoreNLP管道=新的StanfordCoreNLP(道具);
for(字符串txt:测试){
System.out.println(“字符串为:”+txt);
//仅使用给定文本创建空注释
注释文档=新注释(txt);
管道注释(文件);
列出句子=document.get(SentencesAnnotation.class);
Env Env=TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_不区分大小写);
env.setDefaultStringPatternFlags(模式不区分大小写);
CoreMapExpressionExtractor提取器=CoreMapExpressionExtractor.createExtractorFromFiles(环境,“测试度规则”);
for(CoreMap句子:句子){
列表匹配=提取器。提取器表达式(句子);
for(匹配表达式短语:匹配){
//打印出匹配的文本和值
System.out.println(“匹配实体:“+phrase.getText()+”值:“+phrase.getValue().get()”);
}
}
}

为什么要使用
/area of/|…
而不仅仅是
area of |…
?代码需要它才能工作吗?更改它后,我收到了相同的错误。