StanfordNLP-TokensRegexGenerator.readEntries上的ArrayIndexOutOfBoundsException(TokensRegexGenerator.java:696)
我想用斯坦福德NLP的TokensRegexGenerator识别以下技能StanfordNLP-TokensRegexGenerator.readEntries上的ArrayIndexOutOfBoundsException(TokensRegexGenerator.java:696),java,nlp,stanford-nlp,Java,Nlp,Stanford Nlp,我想用斯坦福德NLP的TokensRegexGenerator识别以下技能 专业领域 知识领域 计算机技能 技术经验 技术技能 还有更多类似上面的文本序列 代码- Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
专业领域
知识领域
计算机技能
技术经验
技术技能
还有更多类似上面的文本序列
代码-
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
List tokens = new ArrayList<>();
// traversing each sentence from array of sentence.
for (String txt : tests) {
System.out.println("String is : " + txt);
// create an empty Annotation just with the given text
Annotation document = new Annotation(txt);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
/* Next we can go over the annotated sentences and extract the annotated words,
Using the CoreLabel Object */
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println("annotated coreMap sentences : " + token);
// Extracting NER tag for current token
String ne = token.get(NamedEntityTagAnnotation.class);
String word = token.get(CoreAnnotations.TextAnnotation.class);
System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
System.out.println("Named Entity : " + ne);
}
}
Properties=newproperties();
道具放置(“注释器”、“标记化、ssplit、pos、引理、ner”);
StanfordCoreNLP管道=新的StanfordCoreNLP(道具);
pipeline.addAnnotator(新令牌regexGenerator(“./mapping/test_degree.rule”,true));
String[]tests={“文学学士是个好学位”,“技术技能是软件开发人员必须具备的。”;
List tokens=new ArrayList();
//从句子数组遍历每个句子。
for(字符串txt:测试){
System.out.println(“字符串为:”+txt);
//仅使用给定文本创建空注释
注释文档=新注释(txt);
管道注释(文件);
列出句子=document.get(SentencesAnnotation.class);
/*接下来我们可以浏览注释句子并提取注释单词,
使用CoreLabel对象*/
for(CoreMap句子:句子){
for(CoreLabel标记:句子.get(TokensAnnotation.class)){
System.out.println(“带注释的coreMap语句:“+token”);
//正在提取当前令牌的NER标记
字符串ne=token.get(NamedEntityTagAnnotation.class);
String word=token.get(CoreAnnotations.TextAnnotation.class);
System.out.println(“当前单词:”+Word+“POS:”+token.get(PartOfSpeechAnnotation.class));
System.out.println(“引理:+token.get(LemmaAnnotation.class));
System.out.println(“命名实体:+ne”);
}
}
我的正则表达式规则文件是-
$SKILL_FIRST_KEYWORD=“/area of/|/area of/|/technical/|/computer/|/professional/”
$SKILL_KEYWORD=“/knowledge/|/SKILL/|/skills/|/skills/|/experience/”
tokens={type:“CLASS”,value:“edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation”}
{
规则类型:“令牌”,
模式:($SKILL\u FIRST\u关键字+$SKILL\u关键字),
结果:“技能”
}
我收到ArrayIndexOutOfBoundsException
错误。我想我的规则文件有问题。有人能告诉我哪里出错了吗
期望输出-
专业领域-技能
知识领域-技能
计算机技能-技能
等等
提前感谢。您应该使用TokensRegexAnnotator,而不是TokensRegexGenerator 您应该查看这些线程以了解更多信息:
以上@StanfordNLPHelp接受的答案帮助我解决了这个问题。所有的功劳都归他/她 我只是想总结一下最终代码是如何获得所需格式的输出的,希望它能帮助某些人 首先,我更改了规则文件
$SKILL_FIRST_关键字=“/area of | area of | Technical | computer | professional/”
$SKILL_KEYWORD=“/knowledge | SKILL | skills | experience/”
然后在代码中
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
for (String txt : tests) {
System.out.println("String is : " + txt);
// create an empty Annotation just with the given text
Annotation document = new Annotation(txt);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules");
for (CoreMap sentence : sentences) {
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
}
}
}
props.put(“注释器”、“标记化、ssplit、pos、引理、ner”);
StanfordCoreNLP管道=新的StanfordCoreNLP(道具);
for(字符串txt:测试){
System.out.println(“字符串为:”+txt);
//仅使用给定文本创建空注释
注释文档=新注释(txt);
管道注释(文件);
列出句子=document.get(SentencesAnnotation.class);
Env Env=TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_不区分大小写);
env.setDefaultStringPatternFlags(模式不区分大小写);
CoreMapExpressionExtractor提取器=CoreMapExpressionExtractor.createExtractorFromFiles(环境,“测试度规则”);
for(CoreMap句子:句子){
列表匹配=提取器。提取器表达式(句子);
for(匹配表达式短语:匹配){
//打印出匹配的文本和值
System.out.println(“匹配实体:“+phrase.getText()+”值:“+phrase.getValue().get()”);
}
}
}
为什么要使用/area of/|…
而不仅仅是area of |…
?代码需要它才能工作吗?更改它后,我收到了相同的错误。