Python 如何只解析我感兴趣的键？_Python_Json_Csv_Web Scraping

Python 如何只解析我感兴趣的键？

python json csv web-scraping

Python 如何只解析我感兴趣的键？,python,json,csv,web-scraping,Python,Json,Csv,Web Scraping,我已经设法导出了一些JSON，现在我想将它们导出到csv文件中。但是，当我的代码处于当前状态时，最终的csv每个单元格大约有一个dict。然而，我想要的是我感兴趣的每一个键的值，每一列。每个json都有很多我实际上不感兴趣的信息——我只想要像cadId、cadNomeCompleto、cadProfissao和habDes这样的键。其中一些在每个JSON的其他类别中，比如habDes、pt_ar_wsgode_objectos_DadosHabilitacoes、cadHabilitacoes、

我已经设法导出了一些JSON，现在我想将它们导出到csv文件中。但是，当我的代码处于当前状态时，最终的csv每个单元格大约有一个dict。然而，我想要的是我感兴趣的每一个键的值，每一列。每个json都有很多我实际上不感兴趣的信息——我只想要像cadId、cadNomeCompleto、cadProfissao和habDes这样的键。其中一些在每个JSON的其他类别中，比如habDes、pt_ar_wsgode_objectos_DadosHabilitacoes、cadHabilitacoes、RegistoBiograficoList

我搜索了一些JSON文档，看看是否有一些函数以我需要的方式将键作为输入。到目前为止，我还不能只解析我想要的键，并将它们导出，以便使用csv文件创建统一的列。有人能解释一下我做错了什么，并告诉我怎么做吗

import json
import csv
from csv import DictWriter


list_json = ['a705932387657456c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e7657456c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoXIII_json.txt&Inline=true',
             'a705932387657456c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932395953556c66616e4e76626935306548513d&fich=RegistoBiograficoXII_json.txt&Inline=true',
             'a705932387657456b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a6231684a5832707a6232347564486830&fich=RegistoBiograficoXI_json.txt&Inline=true',
             'a7059323876574355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e7657463971633239754c6e523464413d3d&fich=RegistoBiograficoX_json.txt&Inline=true',
             'a7059323876566b6c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566b6c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoVIII_json.txt&Inline=true',
             'a7059323876566b6c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566b6c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoVIII_json.txt&Inline=true',
             'a7059323876566b6c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932395753556c66616e4e76626935306548513d&fich=RegistoBiograficoVII_json.txt&Inline=true',
             'a7059323876566b6b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62315a4a5832707a6232347564486830&fich=RegistoBiograficoVI_json.txt&Inline=true',
             'a7059323876566955794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566c3971633239754c6e523464413d3d&fich=RegistoBiograficoV_json.txt&Inline=true',
             'a70593238765356596c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62306c575832707a6232347564486830&fich=RegistoBiograficoIV_json.txt&Inline=true',
             'a705932387653556c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932394a53556c66616e4e76626935306548513d&fich=RegistoBiograficoIII_json.txt&Inline=true',
             'a705932387653556b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62306c4a5832707a6232347564486830&fich=RegistoBiograficoII_json.txt&Inline=true',
             'a7059323876513239756333527064485670626e526c4c314a6c5a326c7a644739436157396e636d466d61574e765132397563313971633239754c6e523464413d3d&fich=RegistoBiograficoCons_json.txt&Inline=true']


result = []

for i in list_json:
    url = 'http://app.parlamento.pt/webutils/docs/doc.txt?path=6148523063446f764c324679626d56304c3239775a57356b595852684c3052685a47397a51574a6c636e5276637939535a576470633352764a544977516d6c765a334c446f575{}'.format(i)
    r = requests.get(url)
    cont = r.json()
    result.append(cont)


with open('bio.csv', 'w', newline='', encoding='utf-8-sig') as outfile:
    writer = DictWriter(outfile, ('?xml', 'RegistoBiografico'))
    writer.writerows(result)

您可以迭代子键以提取数据，例如

result = []
for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in ['cadId', 'cadNomeCompleto', 'cadProfissao']:
        try:
            tmp_row.append(child[k])
        except KeyError:
            print(f"  missing {k} for {child['cadId']}")
            # insert None for missing value so columns still match
            tmp_row.append(None)
    result.append(tmp_row)

运行此命令显示一些条目没有全部数据：

  missing cadProfissao for 5950
  missing cadProfissao for 6063
  missing cadProfissao for 6121
  missing cadProfissao for 5534
  missing cadProfissao for 695
  missing cadProfissao for 5952
  missing cadProfissao for 4104
  missing cadProfissao for 4389
  missing cadProfissao for 2445
>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter']
>>>

要添加嵌套键，可以插入

tmp_row.append（child['a']['b]['c']）

，但是还需要复制对缺少值的处理

使用该模块，您可以指定要查找的变量的路径：

from jsonpointer import resolve_pointer as j_get
result = []
search_dict = {
  'Id': '/cadId',
  'NomeCompleto': '/cadNomeCompleto',
  'Profissao':'/cadProfissao',
  'habDes':'/cadHabilitacoes/pt_ar_wsgode_objectos_DadosHabilitacoes/habDes',
}

for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in search_dict.keys():
        tmp_row.append(j_get(child, search_dict[k], None))
    result.append(tmp_row)

我已经删除了

KeyError

异常处理，因为我为

resolve\u指针

函数提供了一个默认值

None

。现在，结果包含：

>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter', 'Ciência Política']

如果您对哪些行或有多少行不完整感兴趣，可以使用列表：

>>> len([x for x in result if None in x])
165

但在csv输出中更容易查看。