Python 对解析树的结构进行编码

Python 对解析树的结构进行编码,python,lstm,sentiment-analysis,recurrent-neural-network,parse-tree,Python,Lstm,Sentiment Analysis,Recurrent Neural Network,Parse Tree,我正在研究数据集,我试图理解这两个文件STree.txt和SOStr.txt,它们对每个句子的三个语法进行编码 例如,我如何解码这个解析三 Effective|but|too-tepid|biopic 6|6|5|5|7|7|0 自述文件中说: SOStr.txt和STree.txt对解析树的结构进行编码。STree以父指针格式对树进行编码。每行 对应于DataSetSequences.txt文件中的每个句子 是否有解析器将句子转换成这种格式?我怎样才能破解这个解析三 Effectiv

我正在研究数据集,我试图理解这两个文件STree.txtSOStr.txt,它们对每个句子的三个语法进行编码

例如,我如何解码这个解析三

 Effective|but|too-tepid|biopic

 6|6|5|5|7|7|0
自述文件中说:

  • SOStr.txt和STree.txt对解析树的结构进行编码。STree以父指针格式对树进行编码。每行 对应于DataSetSequences.txt文件中的每个句子
  • 是否有解析器将句子转换成这种格式?我怎样才能破解这个解析三

     Effective|but|too-tepid|biopic
    
     6|6|5|5|7|7|0
    
    我用这个python脚本打印上一句的选区树

     with open( 'parents.txt') as parentsfile,\
      open( 'sents.txt') as toksfile:
           parents=[]
           toks =[]
           const_trees =[]
           for line in parentsfile:
               parents.append(map(int, line.split()))      
           for line in toksfile:
               toks.append(line.strip().split())
           for i in xrange(len(toks)):
               const_trees.append(load_constituency_tree(parents[i], toks[i]))
    
               #print (const_trees[i].left.word)
               attrs = vars(const_trees[i])
               print ', '.join("%s: %s" % item for item in attrs.items())
    
               attrs = vars(const_trees[i].right)
               print ', '.join("%s: %s" % item for item in attrs.items())
    
               attrs = vars(const_trees[i].left)
               print ', '.join("%s: %s" % item for item in attrs.items()) 
    
               attrs = vars(const_trees[i].right.right)
               print ', '.join("%s: %s" % item for item in attrs.items())
    
               attrs = vars(const_trees[i].right.left)
               print ', '.join("%s: %s" % item for item in attrs.items())
    
               attrs = vars(const_trees[i].left.left)
               print ', '.join("%s: %s" % item for item in attrs.items())
    
               attrs = vars(const_trees[i].left.right)
               print ', '.join("%s: %s" % item for item in attrs.items()) 
    
               break
    
    我意识到第一句话的树如下:

                                  6
                                  |
                    +-------------+------------+
                    |                          |
                    5                          4
          +---------+---------+      +---------+---------+
          |                   |      |                   |
      Effective              but  too-tepid            biopic
    
    如本文所述,非终端是词组类型,但在树的这个表示中,这些是索引,可能是词组类型字典的索引,我的问题是这本字典在哪里?我如何在一组短语中转换这个int

    我的解决方案: 我不确定这是否是解决方案,但我将此函数用于转换到响应父指针列表中:

    # given the array returned by ptree.trepositions('postorder') of the nltk library i.e
    # an array of tuple like this:
    # [(0, 0), (0,), (1, 0, 0), (1, 0), (1, 1, 0), (1, 1, 1), (1, 1), (1,), ()]
    # that describe the structure of a tree where each index of the array is the  index of a node in the tree in a postorder fashion
    # return a list of parents for each node i.e [2, 9, 4, 8, 7, 7, 8, 9, 0] where 0 means that is the root.
    # the previous array describe the structure for this tree
    #             S
    #  ___________|___
    # |               VP
    # |      _________|___
    # NP    V             NP
    # |     |          ___|____
    # I  enjoyed      my     cookie
    
    
    def make_parents_list(treepositions):
        parents = []
        for i in range(0,len(treepositions)):
            if len(treepositions[i])==0:
                parent = 0
                parents.append(parent)
            if len(treepositions[i])>0:
                parent_s = [j+1 for j in range(0,len(treepositions)) if ((j > i) and (len(treepositions[j]) == (len(treepositions[i])-1))) ]
                #print parent_s[0]
                parents.append(parent_s[0])
        return parents