Java CoreNLP提取令牌的跨度
我想提取一个标记化的Java CoreNLP提取令牌的跨度,java,annotations,nlp,stanford-nlp,Java,Annotations,Nlp,Stanford Nlp,我想提取一个标记化的字符串的跨度。使用斯坦福大学的CoreNLP,我有: Properties props; props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); this.pipeline = new StanfordCoreNLP(props); String answerText = "This is the answer"; ArrayList<IntPair>
字符串的跨度。使用斯坦福大学的CoreNLP,我有:
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
this.pipeline = new StanfordCoreNLP(props);
String answerText = "This is the answer";
ArrayList<IntPair> tokenSpans = new ArrayList<IntPair>();
// create an empty Annotation with just the given text
Annotation document = new Annotation(answerText);
// run all Annotators on this text
this.pipeline.annotate(document);
// Iterate over all of the sentences
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// Iterate over all tokens in a sentence
for (CoreLabel fullToken: sentence.get(TokensAnnotation.class)) {
IntPair span = fullToken.get(SpanAnnotation.class);
tokenSpans.add(span);
}
}
期望输出:
(0,3), (5,6), (8,10), (12,17)
问题在于使用SpanAnnotation
,它适用于树
。此查询的正确类是characterOffsetBeginNotation
和characterOffsetEndNotation
例如,它们可以这样使用:
List<IntPair> spans = tokenSeq.stream()
.map(token ->
new IntPair(
token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class),
token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class)))
List span=tokenSeq.stream()
.map(令牌->
新IntPair(
get(CoreAnnotations.CharacterOffsetBeginAnotation.class),
get(CoreAnnotations.CharacterOffsetEndAnnotation.class)))
…请原谅我的压痕
List<IntPair> spans = tokenSeq.stream()
.map(token ->
new IntPair(
token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class),
token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class)))