在Python中从XML创建缺失数据的数据帧_Python_Xml_Parsing_Dataframe_Tree

在Python中从XML创建缺失数据的数据帧

python xml parsing dataframe tree

在Python中从XML创建缺失数据的数据帧,python,xml,parsing,dataframe,tree,Python,Xml,Parsing,Dataframe,Tree,我对Python非常陌生（

我对Python非常陌生（<1周），我希望在数据框架中读入三个“变量”

PMID

、

Abstract Text

、和

Mesh

。我的.xml是10GB

现在，下面的代码生成PMID和抽象文本的列表。我如何将其转换为一个数据帧，其中有三个变量，

PMID

、

抽象文本

、和

Mesh

，其中

Mesh

中的每个描述符名称与XML之间用逗号分隔（例如：腺癌、抗肿瘤药物、结直肠肿瘤）？请注意，以下代码段仅为1

PMID

。总共大约有180万

请注意，一些PMID不包含任何抽象文本或网格…在这种情况下，我希望

NA

或“”代表其行

import xml.etree.cElementTree as etree

# read in all PMIDs and Abstract Texts - got too scared to parse in Mesh incorrectly since it's very time consuming to re-run
pmid_abstract = []
for event, element in etree.iterparse("pubmed_result.xml"):
    if element.tag in ["PMID", "AbstractText"]:
        pmid_abstract.append(element.text)
    element.clear()

它仅包含一个PMID的.xml中的相关标记

<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">29310420</PMID>       
        <Article PubModel="Print">
            <Abstract>
                <AbstractText Label="RATIONALE" NlmCategory="BACKGROUND">Regorafenib is the new standard third-line therapy in metastatic colorectal cancer (mCRC). However, the reported 1-year overall survival rate does not exceed 25%.</AbstractText>
                <AbstractText Label="PATIENT CONCERNS" NlmCategory="UNASSIGNED">A 55-year-old man affected by mCRC, treated with regorafenib combined with stereotactic body radiotherapy (SBRT), showing a durable response.</AbstractText>
            </Abstract>
        </Article>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName UI="D000230" MajorTopicYN="N">Adenocarcinoma</DescriptorName>
                <QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
                <QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D000970" MajorTopicYN="N">Antineoplastic Agents</DescriptorName>
                <QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D015179" MajorTopicYN="N">Colorectal Neoplasms</DescriptorName>
                <QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
                <QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
            </MeshHeading>
        </MeshHeadingList>
    </MedlineCitation>
</PubmedArticle>


29310420
雷戈拉非尼是转移性结直肠癌（mCRC）的新标准三线疗法。然而，报告的1年总生存率不超过25%。
一名55岁男性患者受mCRC影响，接受雷戈拉非尼联合立体定向放射治疗（SBRT），显示出持久的反应。
腺癌
诊断成像
治疗
抗肿瘤药物
治疗用途
结直肠肿瘤
诊断成像
治疗

这可能是您想要的

我在xml文件末尾附加了一个完整的

PubmedArticles

元素的副本，然后将这两个元素包含在一个

PubmedArticles

（复数）元素中，以演示处理此类文件所涉及的原则。因为您的文件太大，所以我选择将临时结果放入sql数据库表中，然后将其导入pandas

第一次通过循环时，没有要处理的记录。之后，每次遇到

PMID

元素时，这意味着前面的

PubmedArticle

已被完全处理并可用于存储到数据库中。当遇到其他元素时，只需将它们插入表示当前文章的字典中

from xml.etree import ElementTree
import sqlite3
import pandas as pd

conn = sqlite3.connect('ragtime.db')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS ragtime')
c.execute('''CREATE TABLE ragtime (PMID text, AbstractText Text, DescriptorName Text)''')

with open('ragtime.csv', 'w') as ragtime:
    record = None
    for ev, el in ElementTree.iterparse('ragtime.xml'):
        if el.tag=='PMID':
            if record:
                c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
            record = {'PMID': el.text, 'AbstractText': [], 'DescriptorName': []}
        elif el.tag=='AbstractText':
            record['AbstractText'].append(el.text)
        elif el.tag=='DescriptorName':
            record['DescriptorName'].append(el.text)
        else:
            pass
    c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])

conn.commit()

df = pd.read_sql_query('SELECT * FROM ragtime', conn)
print (df.head())

conn.close()

它生成以下打印结果

       PMID                                       AbstractText  \
0  29310420  Regorafenib is the new standard third-line the...   
1  29310425  Regorafenib is the new standard third-line the...   

                                      DescriptorName  
0  Adenocarcinoma,Antineoplastic Agents,Colorecta...  
1  Adenocarcinoma,Antineoplastic Agents,Colorecta...