Stanford nlp 如何将标记添加到没有标记的已解析树中？_Stanford Nlp_Text Parsing

Stanford nlp 如何将标记添加到没有标记的已解析树中？

stanford-nlp

Stanford nlp 如何将标记添加到没有标记的已解析树中？,stanford-nlp,text-parsing,Stanford Nlp,Text Parsing,例如，斯坦福情感树库中的解析树 “（2（2（2）（2）（2）（2）（2）（2结尾）））（3（3（2）采取）（2（2）（2）（2）（2）（2）（2）（2（2）（2）（2）（2）（2）其他）（2意思是‘‘‘‘‘‘）’）” 其中，数字是每个节点的情感标签我想向每个节点添加词性标记信息。例如： “（NP（近端）（近端）（NN端））” 我试图直接解析句子，但生成的树与情绪树库中的树不同（可能是因为解析版本或参数，我尝试与作者联系，但没有回应）如何获取标签信息我认为edu.stanford.nlp.t

例如，斯坦福情感树库中的解析树

“（2（2（2）（2）（2）（2）（2）（2结尾）））（3（3（2）采取）（2（2）（2）（2）（2）（2）（2）（2（2）（2）（2）（2）（2）其他）（2意思是‘‘‘‘‘‘）’）”

其中，数字是每个节点的情感标签

我想向每个节点添加词性标记信息。例如：

“（NP（近端）（近端）（NN端））”

我试图直接解析句子，但生成的树与情绪树库中的树不同（可能是因为解析版本或参数，我尝试与作者联系，但没有回应）

如何获取标签信息

我认为edu.stanford.nlp.touction.BuildBinarizedDataset中的代码应该会有所帮助。main（）方法逐步介绍如何在Java代码中创建这些二叉树

代码中需要注意的一些关键行：

LexicalizedParser parser = LexicalizedParser.loadModel(parserModel);
TreeBinarizer binarizer = TreeBinarizer.simpleTreeBinarizer(parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
...
Tree tree = parser.apply(tokens);
Tree binarized = binarizer.transformTree(tree);

可以从树对象访问节点标记信息。您应该查看javadoc for edu.stanford.nlp.trees.Tree以了解如何访问此信息

在这个答案中，我还有一些代码显示如何访问树：

您需要查看每个树和子树的label（），以获取节点的标记

以下是GitHub上BuildBinarizedDataset.java的参考：

请让我知道，如果有什么是不清楚的，我可以提供进一步的帮助

首先，您需要下载

设立

private LexicalizedParser parser;
private TreeBinarizer binarizer;
private CollapseUnaryTransformer transformer;

parser = LexicalizedParser.loadModel(PCFG_PATH);
binarizer = TreeBinarizer.simpleTreeBinarizer(
      parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
transformer = new CollapseUnaryTransformer();

解析

存取邮资

public String[] constTreePOSTAG(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    String[] tags = new String[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          tags[curIdx] = cur.label().toString();
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        tags[curIdx] = parent.label().toString();
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return tags;
  }

public String[]constreepstag（树）{
Tree binarized=二进制化程序.transformTree（树）；
Tree collapsedUnary=transformer.transformTree（二值化）；
树木.转换到树的标签（折叠式）；
collapsedUnary.indexSpans（）；
List leaves=collapsedUnary.getLeaves（）；
int size=collapsedUnary.size（）-leaves.size（）；
字符串[]标记=新字符串[大小]；
HashMap索引=新的HashMap（）；
int idx=leaves.size（）；
int-leafIdx=0；
用于（树叶：树叶）{
Tree cur=leaf.parent（collapsedUnary）；//转到preterminal
int curIdx=leafIdx++；
布尔完成=假；
而（！完成）{
树父级=当前父级（collapseduary）；
如果（父项==null）{
tags[curIdx]=cur.label（）.toString（）；
打破
}
int-parentIdx；
int parentNumber=parent.nodeNumber（collapsedUnary）；
如果（！index.containsKey（parentNumber））{
parentIdx=idx++；
index.put（parentNumber，parentIdx）；
}否则{
parentIdx=index.get（parentNumber）；
完成=正确；
}
tags[curIdx]=parent.label（）.toString（）；
cur=父母；
curIdx=parentIdx；
}
}
返回标签；
}

以下是运行的完整源代码ConstructencyParse.java：使用参数： java ConstructencyParse-tokpath outputtoken.toks-parentpath outputparent.txt-tagpath outputag.txt<文本文件中的输入句子每行发送一个句子.txt

（注意：源代码是改编自，您还需要替换以调用下面的ConstructencyParse.java文件）

导入edu.stanford.nlp.process.WordTokenFactory；
导入edu.stanford.nlp.ling.HasWord；
导入edu.stanford.nlp.ling.Word；
导入edu.stanford.nlp.ling.corelab；
导入edu.stanford.nlp.ling.TaggedWord；
导入edu.stanford.nlp.process.PTBTokenizer；
导入edu.stanford.nlp.util.StringUtils；
导入edu.stanford.nlp.parser.lexparser.LexicalizedParser；
导入edu.stanford.nlp.parser.lexparser.TreeBinarizer；
导入edu.stanford.nlp.tagger.maxent.MaxentTagger；
导入edu.stanford.nlp.trees.GrammaticalStructure；
导入edu.stanford.nlp.trees.grammaticStructureFactory；
导入edu.stanford.nlp.trees.PennTreebankLanguagePack；
导入edu.stanford.nlp.trees.Tree；
导入edu.stanford.nlp.trees.trees；
导入edu.stanford.nlp.trees.TreebankLanguagePack；
导入edu.stanford.nlp.trees.typedDependence；
导入java.io.BufferedWriter；
导入java.io.FileWriter；
导入java.io.StringReader；
导入java.io.IOException；
导入java.util.ArrayList；
导入java.util.Collection；
导入java.util.List；
导入java.util.HashMap；
导入java.util.Properties；
导入java.util.Scanner；
公共类构成分析{
私有布尔标记化；
私有缓冲编写器、父编写器、标记编写器；
私有词汇化解析器；
私有树二进制化器；
专用变压器；
私有语法结构工厂gsf；
私有静态最终字符串PCFG_PATH=“edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz”；
public ConstructencyParse（字符串tokPath、字符串parentPath、字符串tagPath、布尔标记化）引发IOException{
this.tokenize=tokenize；
如果（tokPath！=null）{
tokWriter=新的缓冲写入程序（新的文件写入程序（tokPath））；
}
parentWriter=newbufferedwriter（newfilewriter（parentPath））；
tagWriter=new BufferedWriter（new FileWriter（tagPath））；
parser=LexicalizedParser.loadModel（PCFG_路径）；
binarizer=TreeBinarizer.simpleTreeBinarizer(
parser.gettlparams（）.headFinder（），parser.treebankLanguagePack（）；
变压器=新卡箍式变压器（）；
//设置以从选区树生成依赖关系表示
TreebankLanguagePack tlp=新的PennTreebankLanguagePack（）；
gsf=tlp.grammaticStructureFactory（）；
}
公共列表语句TOTOKENS（字符串行）{
List tokens=new ArrayList（）；
如果（标记化）{
PTBTokenizer标记器=新的PTBTokenizer（新的StringReader（行），新的WordTokenFactory（），“”）；
for（单词标签；标记器.hasNext（）；）{
add（tokenizer.next（））；
}
}否则{
for（字符串字：line.split（“”）{
添加（新词（词））；
}
}
归还代币；
}
公共树解析（列出标记）{
Tree-Tree=parser.apply（令牌）；
回归树；
}
公共字符串[]常量
public String[] constTreePOSTAG(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    String[] tags = new String[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          tags[curIdx] = cur.label().toString();
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        tags[curIdx] = parent.label().toString();
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return tags;
  }

import edu.stanford.nlp.process.WordTokenFactory;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.Word;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.util.StringUtils;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.TreeBinarizer;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.PennTreebankLanguagePack;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Trees;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.StringReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.HashMap;
import java.util.Properties;
import java.util.Scanner;

public class ConstituencyParse {

  private boolean tokenize;
  private BufferedWriter tokWriter, parentWriter, tagWriter;
  private LexicalizedParser parser;
  private TreeBinarizer binarizer;
  private CollapseUnaryTransformer transformer;
  private GrammaticalStructureFactory gsf;

  private static final String PCFG_PATH = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";

  public ConstituencyParse(String tokPath, String parentPath, String tagPath, boolean tokenize) throws IOException {
    this.tokenize = tokenize;
    if (tokPath != null) {
      tokWriter = new BufferedWriter(new FileWriter(tokPath));
    }
    parentWriter = new BufferedWriter(new FileWriter(parentPath));
    tagWriter = new BufferedWriter(new FileWriter(tagPath));
    parser = LexicalizedParser.loadModel(PCFG_PATH);
    binarizer = TreeBinarizer.simpleTreeBinarizer(
      parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
    transformer = new CollapseUnaryTransformer();


    // set up to produce dependency representations from constituency trees
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    gsf = tlp.grammaticalStructureFactory();
  }

  public List<HasWord> sentenceToTokens(String line) {
    List<HasWord> tokens = new ArrayList<>();
    if (tokenize) {
      PTBTokenizer<Word> tokenizer = new PTBTokenizer(new StringReader(line), new WordTokenFactory(), "");
      for (Word label; tokenizer.hasNext(); ) {
        tokens.add(tokenizer.next());
      }
    } else {
      for (String word : line.split(" ")) {
        tokens.add(new Word(word));
      }
    }

    return tokens;
  }

  public Tree parse(List<HasWord> tokens) {
    Tree tree = parser.apply(tokens);
    return tree;
  }

public String[] constTreePOSTAG(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    String[] tags = new String[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          tags[curIdx] = cur.label().toString();
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        tags[curIdx] = parent.label().toString();
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return tags;
  }

  public int[] constTreeParents(Tree tree) {
    Tree binarized = binarizer.transformTree(tree);
    Tree collapsedUnary = transformer.transformTree(binarized);
    Trees.convertToCoreLabels(collapsedUnary);
    collapsedUnary.indexSpans();
    List<Tree> leaves = collapsedUnary.getLeaves();
    int size = collapsedUnary.size() - leaves.size();
    int[] parents = new int[size];
    HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();

    int idx = leaves.size();
    int leafIdx = 0;
    for (Tree leaf : leaves) {
      Tree cur = leaf.parent(collapsedUnary); // go to preterminal
      int curIdx = leafIdx++;
      boolean done = false;
      while (!done) {
        Tree parent = cur.parent(collapsedUnary);
        if (parent == null) {
          parents[curIdx] = 0;
          break;
        }

        int parentIdx;
        int parentNumber = parent.nodeNumber(collapsedUnary);
        if (!index.containsKey(parentNumber)) {
          parentIdx = idx++;
          index.put(parentNumber, parentIdx);
        } else {
          parentIdx = index.get(parentNumber);
          done = true;
        }

        parents[curIdx] = parentIdx + 1;
        cur = parent;
        curIdx = parentIdx;
      }
    }

    return parents;
  }

  // convert constituency parse to a dependency representation and return the
  // parent pointer representation of the tree
  public int[] depTreeParents(Tree tree, List<HasWord> tokens) {
    GrammaticalStructure gs = gsf.newGrammaticalStructure(tree);
    Collection<TypedDependency> tdl = gs.typedDependencies();
    int len = tokens.size();
    int[] parents = new int[len];
    for (int i = 0; i < len; i++) {
      // if a node has a parent of -1 at the end of parsing, then the node
      // has no parent.
      parents[i] = -1;
    }

    for (TypedDependency td : tdl) {
      // let root have index 0
      int child = td.dep().index();
      int parent = td.gov().index();
      parents[child - 1] = parent;
    }

    return parents;
  }

  public void printTokens(List<HasWord> tokens) throws IOException {
    int len = tokens.size();
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < len - 1; i++) {
      if (tokenize) {
        sb.append(PTBTokenizer.ptbToken2Text(tokens.get(i).word()));
      } else {
        sb.append(tokens.get(i).word());
      }
      sb.append(' ');
    }

    if (tokenize) {
      sb.append(PTBTokenizer.ptbToken2Text(tokens.get(len - 1).word()));
    } else {
      sb.append(tokens.get(len - 1).word());
    }

    sb.append('\n');
    tokWriter.write(sb.toString());
  }

  public void printParents(int[] parents) throws IOException {
    StringBuilder sb = new StringBuilder();
    int size = parents.length;
    for (int i = 0; i < size - 1; i++) {
      sb.append(parents[i]);
      sb.append(' ');
    }
    sb.append(parents[size - 1]);
    sb.append('\n');
    parentWriter.write(sb.toString());
  }

  public void printTags(String[] tags) throws IOException {
    StringBuilder sb = new StringBuilder();
    int size = tags.length;
    for (int i = 0; i < size - 1; i++) {
      sb.append(tags[i]);
      sb.append(' ');
    }
    sb.append(tags[size - 1]);
    sb.append('\n');
    tagWriter.write(sb.toString().toLowerCase());
  }

  public void close() throws IOException {
    if (tokWriter != null) tokWriter.close();
    parentWriter.close();
    tagWriter.close();
  }

  public static void main(String[] args) throws Exception {
    String TAGGER_MODEL = "stanford-tagger/models/english-left3words-distsim.tagger";
    Properties props = StringUtils.argsToProperties(args);
    if (!props.containsKey("parentpath")) {
      System.err.println(
        "usage: java ConstituencyParse -deps - -tokenize - -tokpath <tokpath> -parentpath <parentpath>");
      System.exit(1);
    }

    // whether to tokenize input sentences
    boolean tokenize = false;
    if (props.containsKey("tokenize")) {
      tokenize = true;
    }

    // whether to produce dependency trees from the constituency parse
    boolean deps = false;
    if (props.containsKey("deps")) {
      deps = true;
    }

    String tokPath = props.containsKey("tokpath") ? props.getProperty("tokpath") : null;
    String parentPath = props.getProperty("parentpath");
    String tagPath = props.getProperty("tagpath");

    ConstituencyParse processor = new ConstituencyParse(tokPath, parentPath, tagPath, tokenize);

    Scanner stdin = new Scanner(System.in);
    int count = 0;
    long start = System.currentTimeMillis();
    while (stdin.hasNextLine() && count < 2) {
      String line = stdin.nextLine();
      List<HasWord> tokens = processor.sentenceToTokens(line);

      //end tagger

      Tree parse = processor.parse(tokens);

      // produce parent pointer representation
      int[] parents = deps ? processor.depTreeParents(parse, tokens)
                           : processor.constTreeParents(parse);

      String[] tags = processor.constTreePOSTAG(parse);

      // print
      if (tokPath != null) {
        processor.printTokens(tokens);
      }
      processor.printParents(parents);
      processor.printTags(tags);
      // print tag
      StringBuilder sb = new StringBuilder();
      int size = tags.length;
      for (int i = 0; i < size - 1; i++) {
         sb.append(tags[i]);
         sb.append(' ');
      }
      sb.append(tags[size - 1]);
      sb.append('\n');


      count++;
      if (count % 100 == 0) {
        double elapsed = (System.currentTimeMillis() - start) / 1000.0;
        System.err.printf("Parsed %d lines (%.2fs)\n", count, elapsed);
      }
    }

    long totalTimeMillis = System.currentTimeMillis() - start;
    System.err.printf("Done: %d lines in %.2fs (%.1fms per line)\n",
      count, totalTimeMillis / 100.0, totalTimeMillis / (double) count);
    processor.close();
  }
}