Python 将XML转换为BigQuery的JSON可加载结构

Python 将XML转换为BigQuery的JSON可加载结构,python,json,xml,google-bigquery,Python,Json,Xml,Google Bigquery,我正在工作中学习python,需要帮助改进我的解决方案 我需要将XML数据加载到BigQuery中 我有它的工作,但不确定我是否做了一个明智的方式 我调用一个返回XML结构的API。 我使用ElementTree解析XML,并使用tree.iter()从XML返回标记和文本。 使用以下工具打印我的标签和文本: for node in tree.iter(): print(f'{node.tag}, {node.text}') 返回: Tag Text Resp

我正在工作中学习python,需要帮助改进我的解决方案

我需要将XML数据加载到BigQuery

我有它的工作,但不确定我是否做了一个明智的方式

我调用一个返回XML结构的API。 我使用ElementTree解析XML,并使用tree.iter()从XML返回标记和文本。 使用以下工具打印我的标签和文本:

for node in tree.iter():
    print(f'{node.tag}, {node.text}')
返回:

Tag              Text
Responses        None
Response         None
ResponseId       393
ResponseText     Please respond “Has this loaded” 
ResponseType     single
ResponseStatus   0
Responses标记在每个API调用中只出现一次,但是ResponseResponseStatus是重复组,ResponseId是每个组的键。每个呼叫将返回不到100个重复组

标头中返回了一个键,Response\u key,它是所有响应ID的父项。 我的目标是获取这些数据,将其转换为JSON并流式转换为BigQuery

我需要的表结构是:

ResponseKey、ResponseID、Response、ResponseText、ResponseType、ResponseStatus

我使用的方法是

  • 使用tree.iter()循环并创建列表

    node_list = [] 
    for node in tree.iter():    
    node_list.append(node.tag)
    node_list.append(node.text)
    
  • 使用itertools对列表进行分组(我发现这是一个困难的步骤)

  • 返回:

    [['Responses', 'None'], ['None', 'ResponseId', '393', 'ResponseText', Please respond “Has this loaded” 
    "', 'ResponseType', 'single', 'ResponseStatus', '0'], ['None', 'ResponseId', '394', 'ResponseText', Please confirm “Connection made” "', 'ResponseType', 'single', 'ResponseStatus', '0']]
    
  • 加载到Pandas数据框中,删除任何双引号,以防导致BigQuery出现任何问题
  • 将ResponseKey作为列添加到数据帧
  • 将数据帧转换为JSON并传递到从\u JSON加载\u表\u
  • 它是有效的,但不确定它是否明智

    如有任何改进建议,将不胜感激

    以下是XML的一个示例:

    {"GetResponses":"<Responses><Response><ResponseId>393938<\/ResponseId><ResponseText>Please respond to the following statement:\"The assigned task was easy to complete\"<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393939<\/ResponseId><ResponseText>Did you save your  datafor later? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393940<\/ResponseId><ResponseText>Did you notice how much it cost to find the item? How much was it?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393941<\/ResponseId><ResponseText>Did you select ‘signature on form’? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393942<\/ResponseId><ResponseText>Was it easy to find thethe new page? Why\/why not?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393943<\/ResponseId><ResponseText>Please enter your email. So that we can track your responses, we need you to provide this for each task.<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393944<\/ResponseId><ResponseText>Why didn't you save your  datafor later?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393945<\/ResponseId><ResponseText>Why did you save your  datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393946<\/ResponseId><ResponseText>Did you save your  datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393947<\/ResponseId><ResponseText>Why didn't you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393948<\/ResponseId><ResponseText>Why did you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>4444449<\/ResponseId><ResponseText>Did you select ‘signature on form’?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393950<\/ResponseId><ResponseText>Why wasn't it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393951<\/ResponseId><ResponseText>Was it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393952<\/ResponseId><ResponseText>Please enter your email addressSo that we can track your responses, we need you to provide this for each task<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><\/Responses>"}
    

    我不确定所需的输出是什么, 这是一种方法

    import xml.etree.ElementTree as ET
    import json
    
    p = r"d:\tmp.xml"
    tree = ET.parse(p)
    
    root = tree.getroot()
    
    json_dict = {}
    
    json_dict[root.tag] = root.text
    
    json_dict['response_list'] = []
    
    
    for node in root:
        tmp_dict = {}
        for response_info in node:
            tmp_dict[response_info.tag] = response_info.text
        json_dict['response_list'].append(tmp_dict)
    
    with open(r'd:\out.json', 'w') as of:
        json.dump(json_dict, of)
    

    您可以添加一个示例XML输入和JSON输出吗?看起来您可以迭代XML并将其复制到JSON,而无需middleHi@trigonom中的所有步骤,谢谢您的关注。我已经添加了XML和JSON。谢谢你的帮助,这是一个更优雅的解决方案。为我提供了正确的数据结构,用于流式传输到BigQuery。
    node_list = []
    for node in tree.iter():
        node_list.append(node.tag)
        node_list.append(node.text)
    
    json_format = json.dumps(node_list )
    print(json_format)
    
    
    ["Responses", null, "Response", null, "ResponseId", "393938", "ResponseText", Please respond to the following statement:\"The assigned task was easy to complete"", "ResponseType", "single", "ResponseStatus", "0", "ExtendedType", "0"]
    
    import xml.etree.ElementTree as ET
    import json
    
    p = r"d:\tmp.xml"
    tree = ET.parse(p)
    
    root = tree.getroot()
    
    json_dict = {}
    
    json_dict[root.tag] = root.text
    
    json_dict['response_list'] = []
    
    
    for node in root:
        tmp_dict = {}
        for response_info in node:
            tmp_dict[response_info.tag] = response_info.text
        json_dict['response_list'].append(tmp_dict)
    
    with open(r'd:\out.json', 'w') as of:
        json.dump(json_dict, of)