StanfordNLP-TokensRegexGenerator.readEntries上的ArrayIndexOutOfBoundsException（TokensRegexGenerator.java:696）_Java_Nlp_Stanford Nlp

StanfordNLP-TokensRegexGenerator.readEntries上的ArrayIndexOutOfBoundsException（TokensRegexGenerator.java:696）

java nlp stanford-nlp

StanfordNLP-TokensRegexGenerator.readEntries上的ArrayIndexOutOfBoundsException（TokensRegexGenerator.java:696）,java,nlp,stanford-nlp,Java,Nlp,Stanford Nlp,我想用斯坦福德NLP的TokensRegexGenerator识别以下技能专业领域知识领域计算机技能技术经验技术技能还有更多类似上面的文本序列代码- Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

我想用斯坦福德NLP的TokensRegexGenerator识别以下技能

专业领域
知识领域
计算机技能
技术经验
技术技能

还有更多类似上面的文本序列

代码-

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
    String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
    List tokens = new ArrayList<>();

    // traversing each sentence from array of sentence.
    for (String txt : tests) {
         System.out.println("String is : " + txt);

         // create an empty Annotation just with the given text
         Annotation document = new Annotation(txt);

         pipeline.annotate(document);
         List<CoreMap> sentences = document.get(SentencesAnnotation.class);

         /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
      for (CoreMap sentence : sentences) {
         for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
             System.out.println("annotated coreMap sentences : " + token);
             // Extracting NER tag for current token
             String ne = token.get(NamedEntityTagAnnotation.class);
             String word = token.get(CoreAnnotations.TextAnnotation.class);
             System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
             System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
             System.out.println("Named Entity : " + ne);
    }
  }

Properties=newproperties（）；
道具放置（“注释器”、“标记化、ssplit、pos、引理、ner”）；
StanfordCoreNLP管道=新的StanfordCoreNLP（道具）；
pipeline.addAnnotator（新令牌regexGenerator（“./mapping/test_degree.rule”，true））；
String[]tests={“文学学士是个好学位”，“技术技能是软件开发人员必须具备的。”；
List tokens=new ArrayList（）；
//从句子数组遍历每个句子。
for（字符串txt:测试）{
System.out.println（“字符串为：”+txt）；
//仅使用给定文本创建空注释
注释文档=新注释（txt）；
管道注释（文件）；
列出句子=document.get（SentencesAnnotation.class）；
/*接下来我们可以浏览注释句子并提取注释单词，
使用CoreLabel对象*/
for（CoreMap句子：句子）{
for（CoreLabel标记：句子.get（TokensAnnotation.class））{
System.out.println（“带注释的coreMap语句：“+token”）；
//正在提取当前令牌的NER标记
字符串ne=token.get（NamedEntityTagAnnotation.class）；
String word=token.get（CoreAnnotations.TextAnnotation.class）；
System.out.println（“当前单词：”+Word+“POS:”+token.get（PartOfSpeechAnnotation.class））；
System.out.println（“引理：+token.get（LemmaAnnotation.class））；
System.out.println（“命名实体：+ne”）；
}
}

我的正则表达式规则文件是-

tokens={type:“CLASS”，value:“edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation”}

{ 规则类型：“令牌”，模式：（$SKILL\u FIRST\u关键字+$SKILL\u关键字），结果：“技能” }

我收到

ArrayIndexOutOfBoundsException

错误。我想我的规则文件有问题。有人能告诉我哪里出错了吗

期望输出-

专业领域-技能

知识领域-技能

计算机技能-技能

等等

提前感谢。

您应该使用TokensRegexAnnotator，而不是TokensRegexGenerator

您应该查看这些线程以了解更多信息：

以上@StanfordNLPHelp接受的答案帮助我解决了这个问题。所有的功劳都归他/她

我只是想总结一下最终代码是如何获得所需格式的输出的，希望它能帮助某些人

首先，我更改了规则文件

$SKILL_FIRST_关键字=“/area of | area of | Technical | computer | professional/”
$SKILL_KEYWORD=“/knowledge | SKILL | skills | experience/”

然后在代码中

props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

for (String txt : tests) {
     System.out.println("String is : " + txt);

     // create an empty Annotation just with the given text
     Annotation document = new Annotation(txt);

     pipeline.annotate(document);
     List<CoreMap> sentences = document.get(SentencesAnnotation.class);

     Env env = TokenSequencePattern.getNewEnv();
     env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
     env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);

     CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules");
     for (CoreMap sentence : sentences) {
         List<MatchedExpression> matched = extractor.extractExpressions(sentence);
         for(MatchedExpression phrase : matched){
             // Print out matched text and value
             System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
         }
    }
}

props.put（“注释器”、“标记化、ssplit、pos、引理、ner”）；
StanfordCoreNLP管道=新的StanfordCoreNLP（道具）；
for（字符串txt:测试）{
System.out.println（“字符串为：”+txt）；
//仅使用给定文本创建空注释
注释文档=新注释（txt）；
管道注释（文件）；
列出句子=document.get（SentencesAnnotation.class）；
Env Env=TokenSequencePattern.getNewEnv（）；
env.setDefaultStringMatchFlags（NodePattern.CASE_不区分大小写）；
env.setDefaultStringPatternFlags（模式不区分大小写）；
CoreMapExpressionExtractor提取器=CoreMapExpressionExtractor.createExtractorFromFiles（环境，“测试度规则”）；
for（CoreMap句子：句子）{
列表匹配=提取器。提取器表达式（句子）；
for（匹配表达式短语：匹配）{
//打印出匹配的文本和值
System.out.println（“匹配实体：“+phrase.getText（）+”值：“+phrase.getValue（）.get（）”）；
}
}
}

为什么要使用

/area of/|…

而不仅仅是

area of |…

？代码需要它才能工作吗？更改它后，我收到了相同的错误。