Java 树节点到语法结构依赖的映射

Java 树节点到语法结构依赖的映射,java,tree,nlp,stanford-nlp,Java,Tree,Nlp,Stanford Nlp,我正在使用斯坦福核心NLP框架3.4.1构建维基百科句子的语法分析树。之后,我想从每个解析树中提取特定长度(即最多5个节点)的所有树片段,但在不为每个子树创建新的语法结构的情况下,我很难弄清楚如何做到这一点 这就是我用来构造解析树的代码,大部分代码来自conll2007格式的TreePrint.printTreeInternal(),我修改了它以满足我的输出需要: DocumentPreprocessor dp = new DocumentPreprocessor(new StringR

我正在使用斯坦福核心NLP框架3.4.1构建维基百科句子的语法分析树。之后,我想从每个解析树中提取特定长度(即最多5个节点)的所有树片段,但在不为每个子树创建新的语法结构的情况下,我很难弄清楚如何做到这一点

这就是我用来构造解析树的代码,大部分代码来自conll2007格式的TreePrint.printTreeInternal(),我修改了它以满足我的输出需要:

    DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(documentText));

    for (List<HasWord> sentence : dp) {
        StringBuilder plaintexSyntacticTree = new StringBuilder();
        String sentenceString = Sentence.listToString(sentence);

        PTBTokenizer tkzr = PTBTokenizer.newPTBTokenizer(new StringReader(sentenceString));
        List toks = tkzr.tokenize();
        // skip sentences smaller than 5 words
        if (toks.size() < 5)
            continue;
        log.info("\nTokens are: "+PTBTokenizer.labelList2Text(toks));
        LexicalizedParser lp = LexicalizedParser.loadModel(
        "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
        "-maxLength", "80");
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        Tree parse = lp.apply(toks);
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection<TypedDependency> tdl = gs.allTypedDependencies();
        Tree it = parse.deepCopy(parse.treeFactory(), CoreLabel.factory());
        it.indexLeaves();

        List<CoreLabel> tagged = it.taggedLabeledYield();
        // getSortedDeps
        List<Dependency<Label, Label, Object>> sortedDeps = new ArrayList<Dependency<Label, Label, Object>>();
        for (TypedDependency dep : tdl) {
            NamedDependency nd = new NamedDependency(dep.gov().label(), dep.dep().label(), dep.reln().toString());
            sortedDeps.add(nd);
        }
        Collections.sort(sortedDeps, Dependencies.dependencyIndexComparator());

        for (int i = 0; i < sortedDeps.size(); i++) {
          Dependency<Label, Label, Object> d = sortedDeps.get(i);

          CoreMap dep = (CoreMap) d.dependent();
          CoreMap gov = (CoreMap) d.governor();

          Integer depi = dep.get(CoreAnnotations.IndexAnnotation.class);
          Integer govi = gov.get(CoreAnnotations.IndexAnnotation.class);

          CoreLabel w = tagged.get(depi-1);

          // Used for both course and fine POS tag fields
          String tag = PTBTokenizer.ptbToken2Text(w.tag());

          String word = PTBTokenizer.ptbToken2Text(w.word());

          if (plaintexSyntacticTree.length() > 0)
              plaintexSyntacticTree.append(' ');
          plaintexSyntacticTree.append(word+'/'+tag+'/'+govi);
        }
        log.info("\nTree is: "+plaintexSyntacticTree);
    }
DocumentPreprocessor dp=新的DocumentPreprocessor(新的StringReader(documentText));
对于(列表句子:dp){
StringBuilder PlaintexSyntactRee=新StringBuilder();
String sentenceString=句子。listToString(句子);
PTBTokenizer tkzr=PTBTokenizer.newPTBTokenizer(newstringreader(sentenceString));
List toks=tkzr.tokenize();
//跳过小于5个单词的句子
如果(toks.size()<5)
持续
log.info(“\n索引为:”+PTBTokenizer.labelList2Text(toks));
LexicalizedParser lp=LexicalizedParser.loadModel(
“edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz”,
“-最大长度”、“80”);
TreebankLanguagePack tlp=新的PennTreebankLanguagePack();
语法结构工厂gsf=tlp.grammaticStructureFactory();
树解析=lp.apply(toks);
语法结构gs=gsf.newgrammaticstructure(parse);
集合tdl=gs.allTypedDependencies();
Tree it=parse.deepCopy(parse.treeFactory(),corelab.factory());
it.indexleves();
List taged=it.taggedLabeledYield();
//获取分类数据
List sortedeps=new ArrayList();
用于(类型依赖部门:tdl){
NamedDependency nd=新的NamedDependency(dep.gov().label(),dep.dep().label(),dep.reln().toString());
添加(nd);
}
Collections.sort(sorteddps、Dependencies.dependencyIndexComparator());
对于(int i=0;i0)
plaintexsyntactree.append(“”);
plaintexsyntactree.append(word+'/'+tag+'/'+govi);
}
log.info(“\n路径为:“+plaintexsyntactree”);
}
在输出中,我需要获得以下格式的内容:word/Part of Speech tag/parentID,它与

我看不出该如何从原始语法分析树(据我所知,作为依赖项列表存储在语法结构中)中获取POS标记和parentID,仅用于原始树中的一部分节点

我也看到了一些关于语法结构的提及,但据我所知,它只对构建语法结构有用,而我正试图使用现有的语法结构。 我还看到了一个类似的问题,但这仍然是一个开放的问题,它没有解决子树或创建自定义输出的问题。我没有从语法结构中创建树,而是认为我可以使用树中的节点引用来获取所需的信息,但我基本上缺少了一个等价的getNodeByIndex(),它可以从语法结构中按节点获取索引

更新:我已经按照答案中的建议,使用语义图获得了所有需要的信息。下面是一段基本的代码:

    String documentText = value.toString();
    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,depparse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation(documentText);
    pipeline.annotate(annotation);
    List<CoreMap> sentences =  annotation.get(CoreAnnotations.SentencesAnnotation.class);

    if (sentences != null && sentences.size() > 0) {
        CoreMap sentence = sentences.get(0);
        SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
        log.info("SemanticGraph: "+sg.toDotFormat());
       for (SemanticGraphEdge edge : sg.edgeIterable()) {
           int headIndex = edge.getGovernor().index();
           int depIndex = edge.getDependent().index();
           log.info("["+headIndex+"]"+edge.getSource().word()+"/"+depIndex+"/"+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
       }
    }
String documentText=value.toString();
Properties props=新属性();
props.put(“注释器”、“标记化、ssplit、pos、depparse”);
StanfordCoreNLP管道=新的StanfordCoreNLP(道具);
注释=新注释(documentText);
管道注释(注释);
列出句子=annotation.get(coreanotations.SentencesAnnotation.class);
if(句子!=null&&句子.size()>0){
CoreMap语句=语句。获取(0);
SemanticGraph sg=句子.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
log.info(“SemanticGraph:+sg.toDotFormat());
for(SemanticGraphEdge:sg.edgeitable()){
int headIndex=edge.getGovernor().index();
int depIndex=edge.getDependent().index();
log.info(“[”+headIndex+“]”“+edge.getSource().word()+”/“+depIndex+”/“+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
}
}

谷歌语法n-gram使用依赖树而不是选区树。因此,实际上,获得该表示的唯一方法是将树转换为依赖树。从选区分析中获得的父id将用于中间节点,而不是句子中的另一个单词

我的建议是运行依赖项解析器注释器(
annotators=tokenize、ssplit、pos、depparse
),并从生成的
SemanticGraph
中提取5个相邻节点的所有集群