Java 使用Tregex for Stanford解析器提取与连词连接的VPs/NPs_Java_Parsing_Stanford Nlp

Java 使用Tregex for Stanford解析器提取与连词连接的VPs/NPs

java parsing stanford-nlp

Java 使用Tregex for Stanford解析器提取与连词连接的VPs/NPs,java,parsing,stanford-nlp,Java,Parsing,Stanford Nlp,我想根据连词和逗号拆分树。例如，当我有VP和VP或NP和NP或VP，VP或NP，NP时，我想分别提取每个VP或NP。我有以下代码： List<Tree> subtrees = constituent.subTreeList(); for (int i = 0; i < subtrees.size(); i++) { String s = "@VP $+ CC $+ @VP";

我想根据连词和逗号拆分树。例如，当我有

VP和VP

或

NP和NP

或

VP，VP

或

NP，NP

时，我想分别提取每个VP或NP。我有以下代码：

 List<Tree> subtrees = constituent.subTreeList();

                for (int i = 0; i < subtrees.size(); i++) {
                    String s = "@VP $+ CC $+ @VP";
                    TregexPattern p = TregexPattern.compile(s);
                    TregexMatcher m = p.matcher(subtrees.get(i));
                    while (m.find()) {
                        m.getMatch().pennPrint();
                        Tree foundTree = m.getMatch();
                        System.out.println(m.getMatch());
                    }
                }

这里的主要问题是链式Tregex关系（遵循tgrep和tgrep2的传统）具有特殊的非关联语义：

ar1br2c[r3d]

表示

ar1b

和

ar2c

和

ar3d

。（这对于

的核心用例来说通常是有意义的，这意味着A节点有B和C子节点。要获得另一个分组，需要使用括号。特别是，这里需要的模式是“@VP$+（CC$+@VP）”

这被记录在关系列表下，但我意识到这是一个容易犯的错误，特别是因为相对于典型的数学或编程语言表达式，语义是非常不标准的
然后还有一些其他的改进需要做，正如@dantiston所指出的。对于常规正则表达式，您应该在循环外只编译一次模式。此外，您最好让Tregex在树的节点上迭代，而不是构建所有子树的完整列表。下面是一些很好的示例代码：
Tree t2 = Tree.valueOf("(VP (VP (VB manage) (NP (NP (DT the) (JJ entire) (NN life) (NN cycle)) (PP (IN of) (NP (PRP$ your) (NNS APIs))))) (CC and) (VP (VB expose) (NP (PRP$ your) (NNS APIs)) (PP (TO to) (NP (JJ third-party) (NNS developers)))))");
List<Tree> trees = Collections.singletonList(t2);

String s = "@VP $+ (@CONJP|CC $+ @VP)";
TregexPattern p = TregexPattern.compile(s);
for (Tree t : trees) {
  TregexMatcher m = p.matcher(t);
  while (m.findNextMatchingNode()) {
    Tree foundTree = m.getMatch();
    System.out.println(foundTree);
  }
}

Tree t2=Tree.valueOf（“（（VP）（VP（VB管理）（NP（NP）（DT）（JJ整个）（NN生命周期）（NN周期）））（PP（IN of）（NP（PRP$your）（NNS API‘‘）’））（CC和）（VP（VB公开）（NP（PRP$your）（NNS API））（PP（TO）（NP（JJ第三方）（NNS开发人员‘‘）’）”）；
列表树=集合；
字符串s=“@VP$+（@CONJP | CC$+@VP）”；
TregexPattern p=TregexPattern.compile；
for（树t:树）{
TregexMatcher m=p.matcher（t）；
while（m.findNextMatchingNode（））{
Tree foundTree=m.getMatch（）；
System.out.println（foundTree）；
}
}
你说的“不起作用”是什么意思？您得到了什么输出或异常？您期望的输出是什么？还有，为什么要为每个子树编译模式？@dantiston我对所有子树重复它，因为我期望所有VP或NPs都与一个连词连用。问题是代码找不到任何匹配项。模式有问题吗？
Tree t2 = Tree.valueOf("(VP (VP (VB manage) (NP (NP (DT the) (JJ entire) (NN life) (NN cycle)) (PP (IN of) (NP (PRP$ your) (NNS APIs))))) (CC and) (VP (VB expose) (NP (PRP$ your) (NNS APIs)) (PP (TO to) (NP (JJ third-party) (NNS developers)))))");
List<Tree> trees = Collections.singletonList(t2);

String s = "@VP $+ (@CONJP|CC $+ @VP)";
TregexPattern p = TregexPattern.compile(s);
for (Tree t : trees) {
  TregexMatcher m = p.matcher(t);
  while (m.findNextMatchingNode()) {
    Tree foundTree = m.getMatch();
    System.out.println(foundTree);
  }
}