在Python中从XML创建缺失数据的数据帧
我对Python非常陌生(<1周),我希望在数据框架中读入三个“变量”在Python中从XML创建缺失数据的数据帧,python,xml,parsing,dataframe,tree,Python,Xml,Parsing,Dataframe,Tree,我对Python非常陌生(
PMID
、Abstract Text
、和Mesh
。我的.xml是10GB
现在,下面的代码生成PMID和抽象文本的列表。我如何将其转换为一个数据帧,其中有三个变量,PMID
、抽象文本
、和Mesh
,其中Mesh
中的每个描述符名称与XML之间用逗号分隔(例如:腺癌、抗肿瘤药物、结直肠肿瘤)?请注意,以下代码段仅为1PMID
。总共大约有180万
请注意,一些PMID不包含任何抽象文本或网格…在这种情况下,我希望NA
或“”代表其行
import xml.etree.cElementTree as etree
# read in all PMIDs and Abstract Texts - got too scared to parse in Mesh incorrectly since it's very time consuming to re-run
pmid_abstract = []
for event, element in etree.iterparse("pubmed_result.xml"):
if element.tag in ["PMID", "AbstractText"]:
pmid_abstract.append(element.text)
element.clear()
它仅包含一个PMID的.xml中的相关标记
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">29310420</PMID>
<Article PubModel="Print">
<Abstract>
<AbstractText Label="RATIONALE" NlmCategory="BACKGROUND">Regorafenib is the new standard third-line therapy in metastatic colorectal cancer (mCRC). However, the reported 1-year overall survival rate does not exceed 25%.</AbstractText>
<AbstractText Label="PATIENT CONCERNS" NlmCategory="UNASSIGNED">A 55-year-old man affected by mCRC, treated with regorafenib combined with stereotactic body radiotherapy (SBRT), showing a durable response.</AbstractText>
</Abstract>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000230" MajorTopicYN="N">Adenocarcinoma</DescriptorName>
<QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
<QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000970" MajorTopicYN="N">Antineoplastic Agents</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015179" MajorTopicYN="N">Colorectal Neoplasms</DescriptorName>
<QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
<QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
29310420
雷戈拉非尼是转移性结直肠癌(mCRC)的新标准三线疗法。然而,报告的1年总生存率不超过25%。
一名55岁男性患者受mCRC影响,接受雷戈拉非尼联合立体定向放射治疗(SBRT),显示出持久的反应。
腺癌
诊断成像
治疗
抗肿瘤药物
治疗用途
结直肠肿瘤
诊断成像
治疗
这可能是您想要的
我在xml文件末尾附加了一个完整的PubmedArticles
元素的副本,然后将这两个元素包含在一个PubmedArticles
(复数)元素中,以演示处理此类文件所涉及的原则。因为您的文件太大,所以我选择将临时结果放入sql数据库表中,然后将其导入pandas
第一次通过循环时,没有要处理的记录。之后,每次遇到PMID
元素时,这意味着前面的PubmedArticle
已被完全处理并可用于存储到数据库中。当遇到其他元素时,只需将它们插入表示当前文章的字典中
from xml.etree import ElementTree
import sqlite3
import pandas as pd
conn = sqlite3.connect('ragtime.db')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS ragtime')
c.execute('''CREATE TABLE ragtime (PMID text, AbstractText Text, DescriptorName Text)''')
with open('ragtime.csv', 'w') as ragtime:
record = None
for ev, el in ElementTree.iterparse('ragtime.xml'):
if el.tag=='PMID':
if record:
c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
record = {'PMID': el.text, 'AbstractText': [], 'DescriptorName': []}
elif el.tag=='AbstractText':
record['AbstractText'].append(el.text)
elif el.tag=='DescriptorName':
record['DescriptorName'].append(el.text)
else:
pass
c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
conn.commit()
df = pd.read_sql_query('SELECT * FROM ragtime', conn)
print (df.head())
conn.close()
它生成以下打印结果
PMID AbstractText \
0 29310420 Regorafenib is the new standard third-line the...
1 29310425 Regorafenib is the new standard third-line the...
DescriptorName
0 Adenocarcinoma,Antineoplastic Agents,Colorecta...
1 Adenocarcinoma,Antineoplastic Agents,Colorecta...