Java CoreNLP提取令牌的跨度_Java_Annotations_Nlp_Stanford Nlp

Java CoreNLP提取令牌的跨度

java nlp stanford-nlp

Java CoreNLP提取令牌的跨度,java,annotations,nlp,stanford-nlp,Java,Annotations,Nlp,Stanford Nlp,我想提取一个标记化的字符串的跨度。使用斯坦福大学的CoreNLP，我有： Properties props; props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); this.pipeline = new StanfordCoreNLP(props); String answerText = "This is the answer"; ArrayList<IntPair>

我想提取一个标记化的

字符串的跨度。使用斯坦福大学的CoreNLP，我有：
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
this.pipeline = new StanfordCoreNLP(props);

String answerText = "This is the answer";
ArrayList<IntPair> tokenSpans = new ArrayList<IntPair>();
// create an empty Annotation with just the given text
Annotation document = new Annotation(answerText);
// run all Annotators on this text
this.pipeline.annotate(document);

// Iterate over all of the sentences
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
    // Iterate over all tokens in a sentence
    for (CoreLabel fullToken: sentence.get(TokensAnnotation.class)) {
        IntPair span = fullToken.get(SpanAnnotation.class);
        tokenSpans.add(span);
    }
}

期望输出：
(0,3), (5,6), (8,10), (12,17)

问题在于使用SpanAnnotation
，它适用于树
。此查询的正确类是characterOffsetBeginNotation
和characterOffsetEndNotation

例如，它们可以这样使用：
List<IntPair> spans = tokenSeq.stream()
    .map(token -> 
        new IntPair( 

  token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class),

  token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class)))

List span=tokenSeq.stream（）
.map（令牌->
新IntPair（
get（CoreAnnotations.CharacterOffsetBeginAnotation.class），
get（CoreAnnotations.CharacterOffsetEndAnnotation.class）））

…请原谅我的压痕
List<IntPair> spans = tokenSeq.stream()
    .map(token -> 
        new IntPair( 

  token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class),

  token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class)))