Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将nltk树转换为JSON表示_Python_Json_Tree_Nltk - Fatal编程技术网

Python 将nltk树转换为JSON表示

Python 将nltk树转换为JSON表示,python,json,tree,nltk,Python,Json,Tree,Nltk,我想将以下nltk树表示转换为JSON格式: 期望输出: { "scores": { "filler": [ [ "scores" ], [ "for" ] ], "extent": [ "highest" ], "team"

我想将以下nltk树表示转换为JSON格式:

期望输出:

{
    "scores": {
        "filler": [
            [
                "scores"
            ],
            [
                "for"
            ]
        ],
        "extent": [
            "highest"
        ],
        "team": [
            "India"
        ]
    }
}

将树转换为dict,然后转换为JSON

def tree_to_dict(tree):
    tdict = {}
    for t in tree:
        if isinstance(t, nltk.Tree) and isinstance(t[0], nltk.Tree):
            tdict[t.node] = tree_to_dict(t)
        elif isinstance(t, nltk.Tree):
            tdict[t.node] = t[0]
    return tdict

def dict_to_json(dict):
    return json.dumps(dict)

output_json = dict_to_json({tree.node: tree_to_dict(tree)})

看起来输入树可能包含同名的子项。为了支持一般情况,您可以将每个
转换为一个字典,将其名称映射到其子列表:

from nltk import Tree # $ pip install nltk

def tree2dict(tree):
    return {tree.node: [tree2dict(t)  if isinstance(t, Tree) else t
                        for t in tree]}
例如:

import json
import sys

tree = Tree('scores',
            [Tree('extent', ['highest']),
             Tree('filler',
                  [Tree('filler', ['scores']),
                   Tree('filler', ['for'])]),
             Tree('team', ['India'])])
d = tree2dict(tree)
json.dump(d, sys.stdout, indent=2)
输出:

{
  "scores": [
    {
      "extent": [
        "highest"
      ]
    }, 
    {
      "filler": [
        {
          "filler": [
            "scores"
          ]
        }, 
        {
          "filler": [
            "for"
          ]
        }
      ]
    }, 
    {
      "team": [
        "India"
      ]
    }
  ]
}

将树转换为以树标签为键的字典,然后您可以使用JSON转储轻松地将其转换为JSON

    import nltk.tree.Tree

    def tree_to_dict(tree):
        tree_dict = dict()
        leaves = []
        for subtree in tree:
            if type(subtree) == nltk.tree.Tree:
                tree_dict.update(tree_to_dict(subtree))
            else:
                (expression,tag) = subtree
                leaves.append(expression)
        tree_dict[tree.label()] = " ".join(leaves)

        return tree_dict

相关的替代方案。出于我的目的,我不需要保留精确的树,而是希望将实体提取为键,将标记提取为值列表。对于“汤姆和拉里为爱国者队比赛”这句话,我想要以下内容:

{
  "PERSON": [
    "Tom",
    "Larry"
  ],
  "ORGANIZATION": [
    "Patriots"
  ]
}
这将保留标记的顺序(每个实体类型),同时也不会“跺脚”为实体键设置的值。您可以在其他答案中重复使用相同的
json.dump
代码,以将此dict返回到json

from nltk import tag,chunk,tokenize

def prep(sentence):
    return chunk.ne_chunk(tag.pos_tag(tokenize.word_tokenize(sentence)))

t = prep("Tom and Larry play for the Patriots.")

def tree_to_dict(tree):
    tree_dict = dict()
    for st in tree:
        # not everything gets a NE tag,
        # so we can ignore untagged tokens
        # which are stored in tuples
        if isinstance(st, nltk.Tree):
            if st.label() in tree_dict:
                tree_dict[st.label()] = tree_dict[st.label()] + [st[0][0]]
            else:
                tree_dict[st.label()] = [st[0][0]]
    return tree_dict

print(tree_to_dict(t))
# {'PERSON': ['Tom', 'Larry'], 'ORGANIZATION': ['Patriots']}

它不是有效的JSON:同一个对象中有两个“团队”名称。JSON对象是一组无序的名称/值对。不同的json解析器可能会产生不同的结果:解析器可能只保留第一个“团队”或最后一个“团队”对,或者(不太可能)另请参见:“当对象中的名称不唯一时,接收此类对象的软件的行为是不可预测的。许多实现只报告姓氏/值对。其他实现报告错误或无法解析对象,一些实现报告所有的名称/值对,包括重复的。“同样,源树包含重复的名称
('filler','filler'))
为什么要从输出中删除它们?它在生成dict时被自动删除。删除它们是可以的,因为输出中不需要填充信息。您如何知道输出中不需要填充信息?将
转换为
dict
并使用
json.dump(result\u dict,sys.stdout,indent=2)
而不是手工生成json文本。谢谢。我将再次研究。@J.F.Sebastian如何将树转换为dict?我应该使用哪种方法?
t.node
现在必须切换到
t.label()
。对于“汤姆·布雷迪为爱国者演奏”这句话,输出是:
{'ORGANIZATION':('Patriots',NNP'),'PERSON':('Brady','NNP')}
作为比较,这输出了
{'ORGANIZATION':'Patriots','PERSON':'Brady','S':'为爱国者演奏。}
为“Tom Brady为爱国者演奏”这句话