在Python中从XML创建缺失数据的数据帧

在Python中从XML创建缺失数据的数据帧,python,xml,parsing,dataframe,tree,Python,Xml,Parsing,Dataframe,Tree,我对Python非常陌生(

我对Python非常陌生(<1周),我希望在数据框架中读入三个“变量”
PMID
Abstract Text
、和
Mesh
。我的.xml是10GB

现在,下面的代码生成PMID和抽象文本的列表。我如何将其转换为一个数据帧,其中有三个变量,
PMID
抽象文本
、和
Mesh
,其中
Mesh
中的每个描述符名称与XML之间用逗号分隔(例如:腺癌、抗肿瘤药物、结直肠肿瘤)?请注意,以下代码段仅为1
PMID
。总共大约有180万

请注意,一些PMID不包含任何抽象文本或网格…在这种情况下,我希望
NA
或“”代表其行

import xml.etree.cElementTree as etree

# read in all PMIDs and Abstract Texts - got too scared to parse in Mesh incorrectly since it's very time consuming to re-run
pmid_abstract = []
for event, element in etree.iterparse("pubmed_result.xml"):
    if element.tag in ["PMID", "AbstractText"]:
        pmid_abstract.append(element.text)
    element.clear()
它仅包含一个PMID的.xml中的相关标记

<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">29310420</PMID>       
        <Article PubModel="Print">
            <Abstract>
                <AbstractText Label="RATIONALE" NlmCategory="BACKGROUND">Regorafenib is the new standard third-line therapy in metastatic colorectal cancer (mCRC). However, the reported 1-year overall survival rate does not exceed 25%.</AbstractText>
                <AbstractText Label="PATIENT CONCERNS" NlmCategory="UNASSIGNED">A 55-year-old man affected by mCRC, treated with regorafenib combined with stereotactic body radiotherapy (SBRT), showing a durable response.</AbstractText>
            </Abstract>
        </Article>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName UI="D000230" MajorTopicYN="N">Adenocarcinoma</DescriptorName>
                <QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
                <QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D000970" MajorTopicYN="N">Antineoplastic Agents</DescriptorName>
                <QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D015179" MajorTopicYN="N">Colorectal Neoplasms</DescriptorName>
                <QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
                <QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
            </MeshHeading>
        </MeshHeadingList>
    </MedlineCitation>
</PubmedArticle>

29310420
雷戈拉非尼是转移性结直肠癌(mCRC)的新标准三线疗法。然而,报告的1年总生存率不超过25%。
一名55岁男性患者受mCRC影响,接受雷戈拉非尼联合立体定向放射治疗(SBRT),显示出持久的反应。
腺癌
诊断成像
治疗
抗肿瘤药物
治疗用途
结直肠肿瘤
诊断成像
治疗

这可能是您想要的

我在xml文件末尾附加了一个完整的
PubmedArticles
元素的副本,然后将这两个元素包含在一个
PubmedArticles
(复数)元素中,以演示处理此类文件所涉及的原则。因为您的文件太大,所以我选择将临时结果放入sql数据库表中,然后将其导入pandas

第一次通过循环时,没有要处理的记录。之后,每次遇到
PMID
元素时,这意味着前面的
PubmedArticle
已被完全处理并可用于存储到数据库中。当遇到其他元素时,只需将它们插入表示当前文章的字典中

from xml.etree import ElementTree
import sqlite3
import pandas as pd

conn = sqlite3.connect('ragtime.db')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS ragtime')
c.execute('''CREATE TABLE ragtime (PMID text, AbstractText Text, DescriptorName Text)''')

with open('ragtime.csv', 'w') as ragtime:
    record = None
    for ev, el in ElementTree.iterparse('ragtime.xml'):
        if el.tag=='PMID':
            if record:
                c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
            record = {'PMID': el.text, 'AbstractText': [], 'DescriptorName': []}
        elif el.tag=='AbstractText':
            record['AbstractText'].append(el.text)
        elif el.tag=='DescriptorName':
            record['DescriptorName'].append(el.text)
        else:
            pass
    c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])

conn.commit()

df = pd.read_sql_query('SELECT * FROM ragtime', conn)
print (df.head())

conn.close()
它生成以下打印结果

       PMID                                       AbstractText  \
0  29310420  Regorafenib is the new standard third-line the...   
1  29310425  Regorafenib is the new standard third-line the...   

                                      DescriptorName  
0  Adenocarcinoma,Antineoplastic Agents,Colorecta...  
1  Adenocarcinoma,Antineoplastic Agents,Colorecta...