Java 树节点到语法结构依赖的映射_Java_Tree_Nlp_Stanford Nlp

Java 树节点到语法结构依赖的映射

java tree nlp stanford-nlp

Java 树节点到语法结构依赖的映射,java,tree,nlp,stanford-nlp,Java,Tree,Nlp,Stanford Nlp,我正在使用斯坦福核心NLP框架3.4.1构建维基百科句子的语法分析树。之后，我想从每个解析树中提取特定长度（即最多5个节点）的所有树片段，但在不为每个子树创建新的语法结构的情况下，我很难弄清楚如何做到这一点这就是我用来构造解析树的代码，大部分代码来自conll2007格式的TreePrint.printTreeInternal（），我修改了它以满足我的输出需要： DocumentPreprocessor dp = new DocumentPreprocessor(new StringR

我正在使用斯坦福核心NLP框架3.4.1构建维基百科句子的语法分析树。之后，我想从每个解析树中提取特定长度（即最多5个节点）的所有树片段，但在不为每个子树创建新的语法结构的情况下，我很难弄清楚如何做到这一点

这就是我用来构造解析树的代码，大部分代码来自conll2007格式的TreePrint.printTreeInternal（），我修改了它以满足我的输出需要：

    DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(documentText));

    for (List<HasWord> sentence : dp) {
        StringBuilder plaintexSyntacticTree = new StringBuilder();
        String sentenceString = Sentence.listToString(sentence);

        PTBTokenizer tkzr = PTBTokenizer.newPTBTokenizer(new StringReader(sentenceString));
        List toks = tkzr.tokenize();
        // skip sentences smaller than 5 words
        if (toks.size() < 5)
            continue;
        log.info("\nTokens are: "+PTBTokenizer.labelList2Text(toks));
        LexicalizedParser lp = LexicalizedParser.loadModel(
        "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
        "-maxLength", "80");
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        Tree parse = lp.apply(toks);
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection<TypedDependency> tdl = gs.allTypedDependencies();
        Tree it = parse.deepCopy(parse.treeFactory(), CoreLabel.factory());
        it.indexLeaves();

        List<CoreLabel> tagged = it.taggedLabeledYield();
        // getSortedDeps
        List<Dependency<Label, Label, Object>> sortedDeps = new ArrayList<Dependency<Label, Label, Object>>();
        for (TypedDependency dep : tdl) {
            NamedDependency nd = new NamedDependency(dep.gov().label(), dep.dep().label(), dep.reln().toString());
            sortedDeps.add(nd);
        }
        Collections.sort(sortedDeps, Dependencies.dependencyIndexComparator());

        for (int i = 0; i < sortedDeps.size(); i++) {
          Dependency<Label, Label, Object> d = sortedDeps.get(i);

          CoreMap dep = (CoreMap) d.dependent();
          CoreMap gov = (CoreMap) d.governor();

          Integer depi = dep.get(CoreAnnotations.IndexAnnotation.class);
          Integer govi = gov.get(CoreAnnotations.IndexAnnotation.class);

          CoreLabel w = tagged.get(depi-1);

          // Used for both course and fine POS tag fields
          String tag = PTBTokenizer.ptbToken2Text(w.tag());

          String word = PTBTokenizer.ptbToken2Text(w.word());

          if (plaintexSyntacticTree.length() > 0)
              plaintexSyntacticTree.append(' ');
          plaintexSyntacticTree.append(word+'/'+tag+'/'+govi);
        }
        log.info("\nTree is: "+plaintexSyntacticTree);
    }

DocumentPreprocessor dp=新的DocumentPreprocessor（新的StringReader（documentText））；
对于（列表句子：dp）{
StringBuilder PlaintexSyntactRee=新StringBuilder（）；
String sentenceString=句子。listToString（句子）；
PTBTokenizer tkzr=PTBTokenizer.newPTBTokenizer（newstringreader（sentenceString））；
List toks=tkzr.tokenize（）；
//跳过小于5个单词的句子
如果（toks.size（）<5）
持续
log.info（“\n索引为：”+PTBTokenizer.labelList2Text（toks））；
LexicalizedParser lp=LexicalizedParser.loadModel(
“edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz”，
“-最大长度”、“80”）；
TreebankLanguagePack tlp=新的PennTreebankLanguagePack（）；
语法结构工厂gsf=tlp.grammaticStructureFactory（）；
树解析=lp.apply（toks）；
语法结构gs=gsf.newgrammaticstructure（parse）；
集合tdl=gs.allTypedDependencies（）；
Tree it=parse.deepCopy（parse.treeFactory（），corelab.factory（））；
it.indexleves（）；
List taged=it.taggedLabeledYield（）；
//获取分类数据
List sortedeps=new ArrayList（）；
用于（类型依赖部门：tdl）{
NamedDependency nd=新的NamedDependency（dep.gov（）.label（），dep.dep（）.label（），dep.reln（）.toString（））；
添加（nd）；
}
Collections.sort（sorteddps、Dependencies.dependencyIndexComparator（））；
对于（int i=0；i0）
plaintexsyntactree.append（“”）；
plaintexsyntactree.append（word+'/'+tag+'/'+govi）；
}
log.info（“\n路径为：“+plaintexsyntactree”）；
}

在输出中，我需要获得以下格式的内容：word/Part of Speech tag/parentID，它与

我看不出该如何从原始语法分析树（据我所知，作为依赖项列表存储在语法结构中）中获取POS标记和parentID，仅用于原始树中的一部分节点

我也看到了一些关于语法结构的提及，但据我所知，它只对构建语法结构有用，而我正试图使用现有的语法结构。我还看到了一个类似的问题，但这仍然是一个开放的问题，它没有解决子树或创建自定义输出的问题。我没有从语法结构中创建树，而是认为我可以使用树中的节点引用来获取所需的信息，但我基本上缺少了一个等价的getNodeByIndex（），它可以从语法结构中按节点获取索引

更新：我已经按照答案中的建议，使用语义图获得了所有需要的信息。下面是一段基本的代码：

    String documentText = value.toString();
    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,depparse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation(documentText);
    pipeline.annotate(annotation);
    List<CoreMap> sentences =  annotation.get(CoreAnnotations.SentencesAnnotation.class);

    if (sentences != null && sentences.size() > 0) {
        CoreMap sentence = sentences.get(0);
        SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
        log.info("SemanticGraph: "+sg.toDotFormat());
       for (SemanticGraphEdge edge : sg.edgeIterable()) {
           int headIndex = edge.getGovernor().index();
           int depIndex = edge.getDependent().index();
           log.info("["+headIndex+"]"+edge.getSource().word()+"/"+depIndex+"/"+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
       }
    }

String documentText=value.toString（）；
Properties props=新属性（）；
props.put（“注释器”、“标记化、ssplit、pos、depparse”）；
StanfordCoreNLP管道=新的StanfordCoreNLP（道具）；
注释=新注释（documentText）；
管道注释（注释）；
列出句子=annotation.get（coreanotations.SentencesAnnotation.class）；
if（句子！=null&&句子.size（）>0）{
CoreMap语句=语句。获取（0）；
SemanticGraph sg=句子.get（SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class）；
log.info（“SemanticGraph:+sg.toDotFormat（））；
for（SemanticGraphEdge:sg.edgeitable（））{
int headIndex=edge.getGovernor（）.index（）；
int depIndex=edge.getDependent（）.index（）；
log.info（“[”+headIndex+“]”“+edge.getSource（）.word（）+”/“+depIndex+”/“+edge.getSource（）.get（CoreAnnotations.PartOfSpeechAnnotation.class））；
}
}

谷歌语法n-gram使用依赖树而不是选区树。因此，实际上，获得该表示的唯一方法是将树转换为依赖树。从选区分析中获得的父id将用于中间节点，而不是句子中的另一个单词

我的建议是运行依赖项解析器注释器（

annotators=tokenize、ssplit、pos、depparse

），并从生成的

SemanticGraph

中提取5个相邻节点的所有集群