Stanford nlp 如何将标记添加到没有标记的已解析树中?

Stanford nlp 如何将标记添加到没有标记的已解析树中?,stanford-nlp,text-parsing,Stanford Nlp,Text Parsing,例如,斯坦福情感树库中的解析树 “(2(2(2)(2)(2)(2)(2)(2结尾)))(3(3(2)采取)(2(2)(2)(2)(2)(2)(2)(2(2)(2)(2)(2)(2)其他)(2意思是‘‘‘‘‘‘)’)” 其中,数字是每个节点的情感标签 我想向每个节点添加词性标记信息。例如: “(NP(近端)(近端)(NN端))” 我试图直接解析句子,但生成的树与情绪树库中的树不同(可能是因为解析版本或参数,我尝试与作者联系,但没有回应) 如何获取标签信息 我认为edu.stanford.nlp.t

例如,斯坦福情感树库中的解析树

“(2(2(2)(2)(2)(2)(2)(2结尾)))(3(3(2)采取)(2(2)(2)(2)(2)(2)(2)(2(2)(2)(2)(2)(2)其他)(2意思是‘‘‘‘‘‘)’)”

其中,数字是每个节点的情感标签

我想向每个节点添加词性标记信息。例如:

“(NP(近端)(近端)(NN端))”

我试图直接解析句子,但生成的树与情绪树库中的树不同(可能是因为解析版本或参数,我尝试与作者联系,但没有回应)


如何获取标签信息

我认为edu.stanford.nlp.touction.BuildBinarizedDataset中的代码应该会有所帮助。main()方法逐步介绍如何在Java代码中创建这些二叉树

代码中需要注意的一些关键行:

LexicalizedParser parser = LexicalizedParser.loadModel(parserModel);
TreeBinarizer binarizer = TreeBinarizer.simpleTreeBinarizer(parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
...
Tree tree = parser.apply(tokens);
Tree binarized = binarizer.transformTree(tree);
可以从树对象访问节点标记信息。您应该查看javadoc for edu.stanford.nlp.trees.Tree以了解如何访问此信息

在这个答案中,我还有一些代码显示如何访问树:

您需要查看每个树和子树的label(),以获取节点的标记

以下是GitHub上BuildBinarizedDataset.java的参考:


请让我知道,如果有什么是不清楚的,我可以提供进一步的帮助

首先,您需要下载

设立

private LexicalizedParser parser;
private TreeBinarizer binarizer;
private CollapseUnaryTransformer transformer;

parser = LexicalizedParser.loadModel(PCFG_PATH);
binarizer = TreeBinarizer.simpleTreeBinarizer(
      parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
transformer = new CollapseUnaryTransformer();
解析

存取邮资

public String[] constTreePOSTAG(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    String[] tags = new String[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          tags[curIdx] = cur.label().toString();
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        tags[curIdx] = parent.label().toString();
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return tags;
  }
public String[]constreepstag(树){
Tree binarized=二进制化程序.transformTree(树);
Tree collapsedUnary=transformer.transformTree(二值化);
树木.转换到树的标签(折叠式);
collapsedUnary.indexSpans();
List leaves=collapsedUnary.getLeaves();
int size=collapsedUnary.size()-leaves.size();
字符串[]标记=新字符串[大小];
HashMap索引=新的HashMap();
int idx=leaves.size();
int-leafIdx=0;
用于(树叶:树叶){
Tree cur=leaf.parent(collapsedUnary);//转到preterminal
int curIdx=leafIdx++;
布尔完成=假;
而(!完成){
树父级=当前父级(collapseduary);
如果(父项==null){
tags[curIdx]=cur.label().toString();
打破
}
int-parentIdx;
int parentNumber=parent.nodeNumber(collapsedUnary);
如果(!index.containsKey(parentNumber)){
parentIdx=idx++;
index.put(parentNumber,parentIdx);
}否则{
parentIdx=index.get(parentNumber);
完成=正确;
}
tags[curIdx]=parent.label().toString();
cur=父母;
curIdx=parentIdx;
}
}
返回标签;
}
以下是运行的完整源代码ConstructencyParse.java: 使用参数: java ConstructencyParse-tokpath outputtoken.toks-parentpath outputparent.txt-tagpath outputag.txt<文本文件中的输入句子每行发送一个句子.txt

(注意:源代码是改编自,您还需要替换以调用下面的ConstructencyParse.java文件)

导入edu.stanford.nlp.process.WordTokenFactory;
导入edu.stanford.nlp.ling.HasWord;
导入edu.stanford.nlp.ling.Word;
导入edu.stanford.nlp.ling.corelab;
导入edu.stanford.nlp.ling.TaggedWord;
导入edu.stanford.nlp.process.PTBTokenizer;
导入edu.stanford.nlp.util.StringUtils;
导入edu.stanford.nlp.parser.lexparser.LexicalizedParser;
导入edu.stanford.nlp.parser.lexparser.TreeBinarizer;
导入edu.stanford.nlp.tagger.maxent.MaxentTagger;
导入edu.stanford.nlp.trees.GrammaticalStructure;
导入edu.stanford.nlp.trees.grammaticStructureFactory;
导入edu.stanford.nlp.trees.PennTreebankLanguagePack;
导入edu.stanford.nlp.trees.Tree;
导入edu.stanford.nlp.trees.trees;
导入edu.stanford.nlp.trees.TreebankLanguagePack;
导入edu.stanford.nlp.trees.typedDependence;
导入java.io.BufferedWriter;
导入java.io.FileWriter;
导入java.io.StringReader;
导入java.io.IOException;
导入java.util.ArrayList;
导入java.util.Collection;
导入java.util.List;
导入java.util.HashMap;
导入java.util.Properties;
导入java.util.Scanner;
公共类构成分析{
私有布尔标记化;
私有缓冲编写器、父编写器、标记编写器;
私有词汇化解析器;
私有树二进制化器;
专用变压器;
私有语法结构工厂gsf;
私有静态最终字符串PCFG_PATH=“edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz”;
public ConstructencyParse(字符串tokPath、字符串parentPath、字符串tagPath、布尔标记化)引发IOException{
this.tokenize=tokenize;
如果(tokPath!=null){
tokWriter=新的缓冲写入程序(新的文件写入程序(tokPath));
}
parentWriter=newbufferedwriter(newfilewriter(parentPath));
tagWriter=new BufferedWriter(new FileWriter(tagPath));
parser=LexicalizedParser.loadModel(PCFG_路径);
binarizer=TreeBinarizer.simpleTreeBinarizer(
parser.gettlparams().headFinder(),parser.treebankLanguagePack();
变压器=新卡箍式变压器();
//设置以从选区树生成依赖关系表示
TreebankLanguagePack tlp=新的PennTreebankLanguagePack();
gsf=tlp.grammaticStructureFactory();
}
公共列表语句TOTOKENS(字符串行){
List tokens=new ArrayList();
如果(标记化){
PTBTokenizer标记器=新的PTBTokenizer(新的StringReader(行),新的WordTokenFactory(),“”);
for(单词标签;标记器.hasNext();){
add(tokenizer.next());
}
}否则{
for(字符串字:line.split(“”){
添加(新词(词));
}
}
归还代币;
}
公共树解析(列出标记){
Tree-Tree=parser.apply(令牌);
回归树;
}
公共字符串[]常量
public String[] constTreePOSTAG(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    String[] tags = new String[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          tags[curIdx] = cur.label().toString();
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        tags[curIdx] = parent.label().toString();
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return tags;
  }
import edu.stanford.nlp.process.WordTokenFactory;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.Word;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.util.StringUtils;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.TreeBinarizer;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.PennTreebankLanguagePack;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Trees;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.StringReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.HashMap;
import java.util.Properties;
import java.util.Scanner;

public class ConstituencyParse {

  private boolean tokenize;
  private BufferedWriter tokWriter, parentWriter, tagWriter;
  private LexicalizedParser parser;
  private TreeBinarizer binarizer;
  private CollapseUnaryTransformer transformer;
  private GrammaticalStructureFactory gsf;

  private static final String PCFG_PATH = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";

  public ConstituencyParse(String tokPath, String parentPath, String tagPath, boolean tokenize) throws IOException {
    this.tokenize = tokenize;
    if (tokPath != null) {
      tokWriter = new BufferedWriter(new FileWriter(tokPath));
    }
    parentWriter = new BufferedWriter(new FileWriter(parentPath));
    tagWriter = new BufferedWriter(new FileWriter(tagPath));
    parser = LexicalizedParser.loadModel(PCFG_PATH);
    binarizer = TreeBinarizer.simpleTreeBinarizer(
      parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
    transformer = new CollapseUnaryTransformer();


    // set up to produce dependency representations from constituency trees
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    gsf = tlp.grammaticalStructureFactory();
  }

  public List<HasWord> sentenceToTokens(String line) {
    List<HasWord> tokens = new ArrayList<>();
    if (tokenize) {
      PTBTokenizer<Word> tokenizer = new PTBTokenizer(new StringReader(line), new WordTokenFactory(), "");
      for (Word label; tokenizer.hasNext(); ) {
        tokens.add(tokenizer.next());
      }
    } else {
      for (String word : line.split(" ")) {
        tokens.add(new Word(word));
      }
    }

    return tokens;
  }

  public Tree parse(List<HasWord> tokens) {
    Tree tree = parser.apply(tokens);
    return tree;
  }

public String[] constTreePOSTAG(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    String[] tags = new String[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          tags[curIdx] = cur.label().toString();
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        tags[curIdx] = parent.label().toString();
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return tags;
  }

  public int[] constTreeParents(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    int[] parents = new int[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          parents[curIdx] = 0;
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        parents[curIdx] = parentIdx + 1;
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return parents;
  }

  // convert constituency parse to a dependency representation and return the
  // parent pointer representation of the tree
  public int[] depTreeParents(Tree tree, List<HasWord> tokens) {
    GrammaticalStructure gs = gsf.newGrammaticalStructure(tree);
    Collection<TypedDependency> tdl = gs.typedDependencies();
    int len = tokens.size();
    int[] parents = new int[len];
    for (int i = 0; i < len; i++) {
      // if a node has a parent of -1 at the end of parsing, then the node
      // has no parent.
      parents[i] = -1;
    }

    for (TypedDependency td : tdl) {
      // let root have index 0
      int child = td.dep().index();
      int parent = td.gov().index();
      parents[child - 1] = parent;
    }

    return parents;
  }

  public void printTokens(List<HasWord> tokens) throws IOException {
    int len = tokens.size();
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < len - 1; i++) {
      if (tokenize) {
        sb.append(PTBTokenizer.ptbToken2Text(tokens.get(i).word()));
      } else {
        sb.append(tokens.get(i).word());
      }
      sb.append(' ');
    }

    if (tokenize) {
      sb.append(PTBTokenizer.ptbToken2Text(tokens.get(len - 1).word()));
    } else {
      sb.append(tokens.get(len - 1).word());
    }

    sb.append('\n');
    tokWriter.write(sb.toString());
  }

  public void printParents(int[] parents) throws IOException {
    StringBuilder sb = new StringBuilder();
    int size = parents.length;
    for (int i = 0; i < size - 1; i++) {
      sb.append(parents[i]);
      sb.append(' ');
    }
    sb.append(parents[size - 1]);
    sb.append('\n');
    parentWriter.write(sb.toString());
  }

  public void printTags(String[] tags) throws IOException {
    StringBuilder sb = new StringBuilder();
    int size = tags.length;
    for (int i = 0; i < size - 1; i++) {
      sb.append(tags[i]);
      sb.append(' ');
    }
    sb.append(tags[size - 1]);
    sb.append('\n');
    tagWriter.write(sb.toString().toLowerCase());
  }

  public void close() throws IOException {
    if (tokWriter != null) tokWriter.close();
    parentWriter.close();
    tagWriter.close();
  }

  public static void main(String[] args) throws Exception {
    String TAGGER_MODEL = "stanford-tagger/models/english-left3words-distsim.tagger";
    Properties props = StringUtils.argsToProperties(args);
    if (!props.containsKey("parentpath")) {
      System.err.println(
        "usage: java ConstituencyParse -deps - -tokenize - -tokpath <tokpath> -parentpath <parentpath>");
      System.exit(1);
    }

    // whether to tokenize input sentences
    boolean tokenize = false;
    if (props.containsKey("tokenize")) {
      tokenize = true;
    }

    // whether to produce dependency trees from the constituency parse
    boolean deps = false;
    if (props.containsKey("deps")) {
      deps = true;
    }

    String tokPath = props.containsKey("tokpath") ? props.getProperty("tokpath") : null;
    String parentPath = props.getProperty("parentpath");
    String tagPath = props.getProperty("tagpath");

    ConstituencyParse processor = new ConstituencyParse(tokPath, parentPath, tagPath, tokenize);

    Scanner stdin = new Scanner(System.in);
    int count = 0;
    long start = System.currentTimeMillis();
    while (stdin.hasNextLine() && count < 2) {
      String line = stdin.nextLine();
      List<HasWord> tokens = processor.sentenceToTokens(line);

      //end tagger

      Tree parse = processor.parse(tokens);

      // produce parent pointer representation
      int[] parents = deps ? processor.depTreeParents(parse, tokens)
                           : processor.constTreeParents(parse);

      String[] tags = processor.constTreePOSTAG(parse);

      // print
      if (tokPath != null) {
        processor.printTokens(tokens);
      }
      processor.printParents(parents);
      processor.printTags(tags);
      // print tag
      StringBuilder sb = new StringBuilder();
      int size = tags.length;
      for (int i = 0; i < size - 1; i++) {
         sb.append(tags[i]);
         sb.append(' ');
      }
      sb.append(tags[size - 1]);
      sb.append('\n');


      count++;
      if (count % 100 == 0) {
        double elapsed = (System.currentTimeMillis() - start) / 1000.0;
        System.err.printf("Parsed %d lines (%.2fs)\n", count, elapsed);
      }
    }

    long totalTimeMillis = System.currentTimeMillis() - start;
    System.err.printf("Done: %d lines in %.2fs (%.1fms per line)\n",
      count, totalTimeMillis / 100.0, totalTimeMillis / (double) count);
    processor.close();
  }
}