Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将嵌套json/dict转换为元组格式时出现问题?_Python_Json_Python 3.x_Parsing_Pandas - Fatal编程技术网

Python 将嵌套json/dict转换为元组格式时出现问题?

Python 将嵌套json/dict转换为元组格式时出现问题?,python,json,python-3.x,parsing,pandas,Python,Json,Python 3.x,Parsing,Pandas,更新 考虑以下几点。如何提取符合以下条件的4元组: lema,原始表单,标签,当且仅当其当前的id。到目前为止,我试着: def gettuples(data, level = 0): if isinstance(data, dict): if 'semtheme_list' in data: print(data['semtheme_list'][0]) yield data['semtheme_list'][0]

更新

考虑以下几点。如何提取符合以下条件的4元组:
lema
原始表单
标签
,当且仅当其当前的
id
。到目前为止,我试着:

def gettuples(data, level = 0):
    if isinstance(data, dict):
        if 'semtheme_list' in data:
            print(data['semtheme_list'][0])
            yield data['semtheme_list'][0]

        elif 'analysis_list' in data:
            print(data['analysis_list'][0])
            yield data['analysis_list'][0]

        for val in data.values():
            yield from gettuples(val)

    elif isinstance(data, list):
        for val in data:
            yield from gettuples(val)
通过上述函数,我得到以下(*):

这与我正在寻找的4元组非常相似,因为(**):

但是使用
实体\u列表
id

 entity_list: [{ form: "John Deere", official_form: "Deere & Company", id: "d5250a54a8", sementity: { class: "instance", fiction: "nonfiction", id: "ODENTITY_INDUSTRIAL_COMPANY", type: "Top>Organization>Company>IndustrialCompany" 
}
然后,当我打印时:

result = [['lema:',obj['lemma'], 'original_form', obj['original_form'], 'tag:',obj['tag']] for obj in gettuples(json_data)]

print(result)
我得到了这个错误:

  File "/Users/user/PycharmProjects/Tests/test.py", line 51, in pos_tag2
    result = [['lema:',obj['lemma'], 'original_form', obj['original_form'], 'tag:',obj['tag']] for obj in gettuples(json_data)]
  File "/Users/user/PycharmProjects/Tests/test.py", line 51, in <listcomp>
    result = [['lema:',obj['lemma'], 'original_form', obj['original_form'], 'tag:',obj['tag']] for obj in gettuples(json_data)]
KeyError: 'lemma'
输出:

然后:

输出:


然而,我在获取列表的具体值时遇到了问题。另一个可能的解决办法是熊猫。。。伙计们,你知道怎么做吗?

下面的代码应该满足你的要求。这不是最优雅的方法,但希望它是明确的

import yaml
from pprint import pprint

with open('json_dict.json', 'rU') as f:
    data = yaml.load(f)

results = []
sementity_map = {}

def extract_analysis(l):
    for d in l:
        out = {
            'lemma': d['lemma'],
            'original_form': d['original_form'],
            'tag': d['tag']
        }

        if 'sense_id_list' in d:
            out['id'] = d['sense_id_list'][0]['sense_id']

        results.append( out )

def extract_entities(l):
    for d in l:
        if 'sementity' in d and 'id' in d['sementity']:
            sementity_map[ d['id'] ] = d['sementity']['id']


def find_analysis_and_entities(d):
    if type(d) != dict:  # Added for non-dict values
        return # Fail

    for k, v in d.items():
        if type(v) == list:
            if k == 'analysis_list':
                extract_analysis(v)
            elif k == 'entity_list':
                extract_entities(v)
            else:
                for do in v:
                    find_analysis_and_entities(do)
        else:
            find_analysis_and_entities(v)

def apply_entities(e, m):
    for d in e:
        if 'id' in d:
            if d['id'] in sementity_map:
                d['id'] = sementity_map[ d['id'] ]
            else:
                del d['id']

find_analysis_and_entities(data)
apply_entities(results, sementity_map)                

pprint(results)
对于语义ID,我们保留一个单独的映射字典,并在初始查找运行后应用它。第一个查找用于构建带有裸ID的结果和语义实体映射

问题的一部分(我认为)源于这样一个事实:在找到必须应用的位置之前,您无法确定是否找到/传递了匹配的语义实体id(使用dicts没有帮助)

这里,我们仅在找到id映射时应用它们,否则我们将删除该id字段。例如,
a0a1a5401f
\uu 121232880588840445720
都未列在
实体列表
块中,因此从
结果中删除

上述示例输入文件的输出为:

[{'lemma': 'Robert Downey Jr',
  'original_form': 'Robert Downey Jr',
  'tag': 'NPUU-N-'},
 {'lemma': 'Robert Downey Jr',
  'original_form': 'Robert Downey Jr',
  'tag': 'GNUS3S--'},
 {'lemma': 'top', 'original_form': 'has topped', 'tag': 'VI-S3PPA-N-N9'},
 {'id': 'ODENTITY_MAGAZINE',
  'lemma': 'Forbes',
  'original_form': 'Forbes',
  'tag': 'NP-S-N-'},
 {'lemma': 'magazine', 'original_form': 'magazine', 'tag': 'NC-S-N5'},
 {'lemma': 'magazine', 'original_form': 'Forbes magazine', 'tag': 'GN-S3---'},
 {'lemma': "'s", 'original_form': "'s", 'tag': 'WN-'},
 {'lemma': 'annual', 'original_form': 'annual', 'tag': 'AP-N5'},
 {'lemma': 'list', 'original_form': 'list', 'tag': 'NC-S-N5'},
 {'lemma': 'list', 'original_form': 'annual list', 'tag': 'GN-S3---'},
 {'id': 'ODENTITY_INDUSTRIAL_COMPANY',
  'lemma': 'John Deere',
  'original_form': 'John Deere',
  'tag': 'NP-S-N-'},
 {'lemma': 'John Deere', 'original_form': 'John Deere', 'tag': 'GN-S3Y--'},
 {'lemma': 'John Deere',
  'original_form': 'annual list John Deere',
  'tag': 'GN-S3---'},
 {'lemma': 'John Deere',
  'original_form': "Forbes magazine's annual list John Deere",
  'tag': 'GN-S3D--'},
 {'lemma': '*',
  'original_form': "Robert Downey Jr has topped Forbes magazine's annual list "
                   'John Deere',
  'tag': 'Z-----------'}]

所以你想要四个特定的键?所以它们都包含在
分析\u列表
值的某个地方?如果你有任意嵌套,那么不是真的,如果你知道你想要的键的路径并且它永远不会改变,那么就确定了。好吧,得到你想要的东西很简单,但问题是只有三个
“sementity”
,其中只有两个有
id
而其余的只有15个。所以你的问题是在一个嵌套的树状json对象上查询几个键,同时包含字典和列表?你能定义你提供的json对象的树结构吗?你在你提供的示例或其他文件中得到了错误吗?我想我理解了好吧--我会在一个小时左右更新这个…如果你想自己尝试一下,尝试在搜索时更新以构建一个dict映射,然后在@johndoe我更新了脚本之后转换输出。它现在使用第二个过程将id映射到实体id。@johndoe它工作了吗?我对它做了一些调整以删除丢失的ID-这似乎更符合您的预期输出。@johndoe我怀疑这是当存在一个裸字符串的键时递归调用的问题。我已经对它进行了更新,添加了一个额外的检查(为非dict值添加了
),应该可以解决这个问题。
from pandas.io.json import json_normalize
df = json_normalize(request, ['token_list',['token_list']])
df = pd.DataFrame(df)
df
    affected_by_negation    analysis_list   endp    form    id  inip    quote_level     separation  style   token_list  type
0   no  [{'lemma': '*', 'tag': 'Z-----------', 'origin...   4   Deere   6   0   0   _   {'isTitle': 'no', 'isItalics': 'no', 'isUnderl...   [{'form': 'Deere', 'analysis_list': [{'lemma':...   phrase
df_clean =  df.drop(df.columns[[0, 2,4, 5, 6, 7, 8, 10]], axis=1)
df_clean
list(df_clean.itertuples(index=False))
[Pandas(analysis_list=[{'lemma': '*', 'tag': 'Z-----------', 'original_form': 'Deere'}], form='Deere', token_list=[{'form': 'Deere', 'analysis_list': [{'lemma': 'Edere', 'tag': 'GN-S3---', 'original_form': 'Deere'}, {'lemma': 'deer', 'tag': 'GN-S3---', 'original_form': 'Deere'}, {'lemma': 'Edere', 'tag': 'GN-P3---', 'original_form': 'Deere'}, {'lemma': 'deer', 'tag': 'GN-P3---', 'original_form': 'Deere'}, {'lemma': 'Edere', 'tag': 'GNFU3---', 'original_form': 'Deere'}], 'head': '1', 'separation': '_', 'affected_by_negation': 'no', 'endp': '4', 'type': 'phrase', 'style': {'isTitle': 'no', 'isItalics': 'no', 'isUnderlined': 'no', 'isBold': 'no'}, 'id': '5', 'inip': '0', 'token_list': [{'form': 'Deere', 'affected_by_negation': 'no', 'sense_list': [{'id': '228eaef205', 'info': 'sementity/class=class@fiction=nonfiction@id=ODENTITY_MAMMAL@type=Top>LivingThing>Animal>Vertebrate>Mammal\tsemld_list=sumo:Mammal\tsemtheme_list/id=ODTHEME_ZOOLOGY@type=Top>NaturalSciences>Zoology', 'form': 'deer'}, {'id': 'e7c6da7489', 'info': 'sementity/class=instance@fiction=nonfiction@id=ODENTITY_FIRST_NAME@type=Top>Person>FirstName\tsemld_list=sumo:FirstName', 'form': 'Edere'}], 'separation': '_', 'style': {'isTitle': 'no', 'isItalics': 'no', 'isUnderlined': 'no', 'isBold': 'no'}, 'id': '1', 'inip': '0', 'topic_list': {'entity_list': [{'semld_list': ['sumo:FirstName'], 'form': 'Edere', 'sementity': {'id': 'ODENTITY_FIRST_NAME', 'class': 'instance', 'fiction': 'nonfiction', 'type': 'Top>Person>FirstName'}, 'id': 'e7c6da7489'}], 'concept_list': [{'semld_list': ['sumo:Mammal'], 'form': 'deer', 'semtheme_list': [{'id': 'ODTHEME_ZOOLOGY', 'type': 'Top>NaturalSciences>Zoology'}], 'sementity': {'id': 'ODENTITY_MAMMAL', 'class': 'class', 'fiction': 'nonfiction', 'type': 'Top>LivingThing>Animal>Vertebrate>Mammal'}, 'id': '228eaef205'}]}, 'analysis_list': [{'lemma': 'Edere', 'sense_id_list': [{'sense_id': 'e7c6da7489'}], 'tag': 'NPFS-N-', 'original_form': 'Edere', 'check_info': {'form_list': [{'form': 'Edere'}], 'tag': '6'}}, {'lemma': 'deer', 'sense_id_list': [{'sense_id': '228eaef205'}], 'tag': 'NC-S-N2', 'original_form': 'deer', 'check_info': {'form_list': [{'form': 'deer'}], 'tag': '6'}}, {'lemma': 'deer', 'sense_id_list': [{'sense_id': '228eaef205'}], 'tag': 'NC-P-N2', 'original_form': 'deer', 'check_info': {'form_list': [{'form': 'deer'}], 'tag': '6'}}], 'quote_level': '0', 'endp': '4'}], 'quote_level': '0'}])]
import yaml
from pprint import pprint

with open('json_dict.json', 'rU') as f:
    data = yaml.load(f)

results = []
sementity_map = {}

def extract_analysis(l):
    for d in l:
        out = {
            'lemma': d['lemma'],
            'original_form': d['original_form'],
            'tag': d['tag']
        }

        if 'sense_id_list' in d:
            out['id'] = d['sense_id_list'][0]['sense_id']

        results.append( out )

def extract_entities(l):
    for d in l:
        if 'sementity' in d and 'id' in d['sementity']:
            sementity_map[ d['id'] ] = d['sementity']['id']


def find_analysis_and_entities(d):
    if type(d) != dict:  # Added for non-dict values
        return # Fail

    for k, v in d.items():
        if type(v) == list:
            if k == 'analysis_list':
                extract_analysis(v)
            elif k == 'entity_list':
                extract_entities(v)
            else:
                for do in v:
                    find_analysis_and_entities(do)
        else:
            find_analysis_and_entities(v)

def apply_entities(e, m):
    for d in e:
        if 'id' in d:
            if d['id'] in sementity_map:
                d['id'] = sementity_map[ d['id'] ]
            else:
                del d['id']

find_analysis_and_entities(data)
apply_entities(results, sementity_map)                

pprint(results)
[{'lemma': 'Robert Downey Jr',
  'original_form': 'Robert Downey Jr',
  'tag': 'NPUU-N-'},
 {'lemma': 'Robert Downey Jr',
  'original_form': 'Robert Downey Jr',
  'tag': 'GNUS3S--'},
 {'lemma': 'top', 'original_form': 'has topped', 'tag': 'VI-S3PPA-N-N9'},
 {'id': 'ODENTITY_MAGAZINE',
  'lemma': 'Forbes',
  'original_form': 'Forbes',
  'tag': 'NP-S-N-'},
 {'lemma': 'magazine', 'original_form': 'magazine', 'tag': 'NC-S-N5'},
 {'lemma': 'magazine', 'original_form': 'Forbes magazine', 'tag': 'GN-S3---'},
 {'lemma': "'s", 'original_form': "'s", 'tag': 'WN-'},
 {'lemma': 'annual', 'original_form': 'annual', 'tag': 'AP-N5'},
 {'lemma': 'list', 'original_form': 'list', 'tag': 'NC-S-N5'},
 {'lemma': 'list', 'original_form': 'annual list', 'tag': 'GN-S3---'},
 {'id': 'ODENTITY_INDUSTRIAL_COMPANY',
  'lemma': 'John Deere',
  'original_form': 'John Deere',
  'tag': 'NP-S-N-'},
 {'lemma': 'John Deere', 'original_form': 'John Deere', 'tag': 'GN-S3Y--'},
 {'lemma': 'John Deere',
  'original_form': 'annual list John Deere',
  'tag': 'GN-S3---'},
 {'lemma': 'John Deere',
  'original_form': "Forbes magazine's annual list John Deere",
  'tag': 'GN-S3D--'},
 {'lemma': '*',
  'original_form': "Robert Downey Jr has topped Forbes magazine's annual list "
                   'John Deere',
  'tag': 'Z-----------'}]