Python 如何只解析我感兴趣的键?

Python 如何只解析我感兴趣的键?,python,json,csv,web-scraping,Python,Json,Csv,Web Scraping,我已经设法导出了一些JSON,现在我想将它们导出到csv文件中。但是,当我的代码处于当前状态时,最终的csv每个单元格大约有一个dict。然而,我想要的是我感兴趣的每一个键的值,每一列。每个json都有很多我实际上不感兴趣的信息——我只想要像cadId、cadNomeCompleto、cadProfissao和habDes这样的键。其中一些在每个JSON的其他类别中,比如habDes、pt_ar_wsgode_objectos_DadosHabilitacoes、cadHabilitacoes、

我已经设法导出了一些JSON,现在我想将它们导出到csv文件中。但是,当我的代码处于当前状态时,最终的csv每个单元格大约有一个dict。然而,我想要的是我感兴趣的每一个键的值,每一列。每个json都有很多我实际上不感兴趣的信息——我只想要像cadId、cadNomeCompleto、cadProfissao和habDes这样的键。其中一些在每个JSON的其他类别中,比如habDes、pt_ar_wsgode_objectos_DadosHabilitacoes、cadHabilitacoes、RegistoBiograficoList

我搜索了一些JSON文档,看看是否有一些函数以我需要的方式将键作为输入。到目前为止,我还不能只解析我想要的键,并将它们导出,以便使用csv文件创建统一的列。有人能解释一下我做错了什么,并告诉我怎么做吗

import json
import csv
from csv import DictWriter


list_json = ['a705932387657456c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e7657456c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoXIII_json.txt&Inline=true',
             'a705932387657456c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932395953556c66616e4e76626935306548513d&fich=RegistoBiograficoXII_json.txt&Inline=true',
             'a705932387657456b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a6231684a5832707a6232347564486830&fich=RegistoBiograficoXI_json.txt&Inline=true',
             'a7059323876574355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e7657463971633239754c6e523464413d3d&fich=RegistoBiograficoX_json.txt&Inline=true',
             'a7059323876566b6c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566b6c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoVIII_json.txt&Inline=true',
             'a7059323876566b6c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566b6c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoVIII_json.txt&Inline=true',
             'a7059323876566b6c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932395753556c66616e4e76626935306548513d&fich=RegistoBiograficoVII_json.txt&Inline=true',
             'a7059323876566b6b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62315a4a5832707a6232347564486830&fich=RegistoBiograficoVI_json.txt&Inline=true',
             'a7059323876566955794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566c3971633239754c6e523464413d3d&fich=RegistoBiograficoV_json.txt&Inline=true',
             'a70593238765356596c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62306c575832707a6232347564486830&fich=RegistoBiograficoIV_json.txt&Inline=true',
             'a705932387653556c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932394a53556c66616e4e76626935306548513d&fich=RegistoBiograficoIII_json.txt&Inline=true',
             'a705932387653556b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62306c4a5832707a6232347564486830&fich=RegistoBiograficoII_json.txt&Inline=true',
             'a7059323876513239756333527064485670626e526c4c314a6c5a326c7a644739436157396e636d466d61574e765132397563313971633239754c6e523464413d3d&fich=RegistoBiograficoCons_json.txt&Inline=true']


result = []

for i in list_json:
    url = 'http://app.parlamento.pt/webutils/docs/doc.txt?path=6148523063446f764c324679626d56304c3239775a57356b595852684c3052685a47397a51574a6c636e5276637939535a576470633352764a544977516d6c765a334c446f575{}'.format(i)
    r = requests.get(url)
    cont = r.json()
    result.append(cont)


with open('bio.csv', 'w', newline='', encoding='utf-8-sig') as outfile:
    writer = DictWriter(outfile, ('?xml', 'RegistoBiografico'))
    writer.writerows(result)


您可以迭代子键以提取数据,例如

result = []
for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in ['cadId', 'cadNomeCompleto', 'cadProfissao']:
        try:
            tmp_row.append(child[k])
        except KeyError:
            print(f"  missing {k} for {child['cadId']}")
            # insert None for missing value so columns still match
            tmp_row.append(None)
    result.append(tmp_row)
运行此命令显示一些条目没有全部数据:

  missing cadProfissao for 5950
  missing cadProfissao for 6063
  missing cadProfissao for 6121
  missing cadProfissao for 5534
  missing cadProfissao for 695
  missing cadProfissao for 5952
  missing cadProfissao for 4104
  missing cadProfissao for 4389
  missing cadProfissao for 2445
>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter']
>>>
要添加嵌套键,可以插入
tmp_row.append(child['a']['b]['c'])
,但是还需要复制对缺少值的处理

使用该模块,您可以指定要查找的变量的路径:

from jsonpointer import resolve_pointer as j_get
result = []
search_dict = {
  'Id': '/cadId',
  'NomeCompleto': '/cadNomeCompleto',
  'Profissao':'/cadProfissao',
  'habDes':'/cadHabilitacoes/pt_ar_wsgode_objectos_DadosHabilitacoes/habDes',
}

for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in search_dict.keys():
        tmp_row.append(j_get(child, search_dict[k], None))
    result.append(tmp_row)
我已经删除了
KeyError
异常处理,因为我为
resolve\u指针
函数提供了一个默认值
None
。现在,结果包含:

>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter', 'Ciência Política']
如果您对哪些行或有多少行不完整感兴趣,可以使用列表:

>>> len([x for x in result if None in x])
165

但在csv输出中更容易查看。

您可以迭代子键以提取数据,例如

result = []
for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in ['cadId', 'cadNomeCompleto', 'cadProfissao']:
        try:
            tmp_row.append(child[k])
        except KeyError:
            print(f"  missing {k} for {child['cadId']}")
            # insert None for missing value so columns still match
            tmp_row.append(None)
    result.append(tmp_row)
运行此命令显示一些条目没有全部数据:

  missing cadProfissao for 5950
  missing cadProfissao for 6063
  missing cadProfissao for 6121
  missing cadProfissao for 5534
  missing cadProfissao for 695
  missing cadProfissao for 5952
  missing cadProfissao for 4104
  missing cadProfissao for 4389
  missing cadProfissao for 2445
>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter']
>>>
要添加嵌套键,可以插入
tmp_row.append(child['a']['b]['c'])
,但是还需要复制对缺少值的处理

使用该模块,您可以指定要查找的变量的路径:

from jsonpointer import resolve_pointer as j_get
result = []
search_dict = {
  'Id': '/cadId',
  'NomeCompleto': '/cadNomeCompleto',
  'Profissao':'/cadProfissao',
  'habDes':'/cadHabilitacoes/pt_ar_wsgode_objectos_DadosHabilitacoes/habDes',
}

for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in search_dict.keys():
        tmp_row.append(j_get(child, search_dict[k], None))
    result.append(tmp_row)
我已经删除了
KeyError
异常处理,因为我为
resolve\u指针
函数提供了一个默认值
None
。现在,结果包含:

>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter', 'Ciência Política']
如果您对哪些行或有多少行不完整感兴趣,可以使用列表:

>>> len([x for x in result if None in x])
165

但在csv输出中更容易查看。

您可以手动指定json响应中的元素,
cont[key1],cont[key2][subkey]
。如果不能确保所有数据都存在,则可能需要处理
KeyError
异常。如果json层次结构很复杂,您可以使用它来提取所需的元素。您好,谢谢您的回答。我已经按照你的建议做了-
cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_dadosregistobiograficowb']
,现在我遇到了一个类似的问题:在每个感兴趣的JSON子键中,我有311个dicts,我想从其中找到我提到的特定键,这取决于“KeyError”异常。我想对每个JSON都这样做。关于如何在311个dict中循环搜索我需要的键值,有什么建议吗?您可以在json响应中手动指定元素,
cont[key1],cont[key2][subkey]
。如果不能确保所有数据都存在,则可能需要处理
KeyError
异常。如果json层次结构很复杂,您可以使用它来提取所需的元素。您好,谢谢您的回答。我已经按照你的建议做了-
cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_dadosregistobiograficowb']
,现在我遇到了一个类似的问题:在每个感兴趣的JSON子键中,我有311个dicts,我想从其中找到我提到的特定键,这取决于“KeyError”异常。我想对每个JSON都这样做。关于如何在311个dict中循环搜索我需要的键值,有什么建议吗?这很有帮助!非常感谢:)那帮了大忙!非常感谢:)