尝试解析xml文件并使用Python将文本数据放入字典。关键错误:0
我正在使用Python elementTree包解析一个XML文件(如下)尝试解析xml文件并使用Python将文本数据放入字典。关键错误:0,python,xml,csv,parsing,elementtree,Python,Xml,Csv,Parsing,Elementtree,我正在使用Python elementTree包解析一个XML文件(如下) <?xml version="1.0" encoding="Cp1252"?> <CATALOG> <CD> <COLUMN NAME='TITLE'>Empire Burlesque</COLUMN> <COLUMN NAME='ARTIST'>Bob Dylan</COLUMN> <COLUMN NA
<?xml version="1.0" encoding="Cp1252"?>
<CATALOG>
<CD>
<COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
<COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
<COLUMN NAME='COUNTRY'>USA</COLUMN>
<COLUMN NAME='COMPANY'>Columbia</COLUMN>
<COLUMN NAME='PRICE'>10.90</COLUMN>
<COLUMN NAME='YEAR'>1985</COLUMN>
</CD>
<CD>
<COLUMN NAME='TITLE'>Hide your heart</COLUMN>
<COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
<COLUMN NAME='COUNTRY'>UK</COLUMN>
<COLUMN NAME='COMPANY'>CBS Records</COLUMN>
<COLUMN NAME='PRICE'>9.90</COLUMN>
<COLUMN NAME='YEAR'>1988</COLUMN>
</CD>
<CD>
<COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
<COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
<COLUMN NAME='COUNTRY'>USA</COLUMN>
<COLUMN NAME='COMPANY'>RCA</COLUMN>
<COLUMN NAME='PRICE'>9.90</COLUMN>
<COLUMN NAME='YEAR'>1982</COLUMN>
</CD>
</CATALOG>
此代码导致键错误:0
$ python sample.py Traceback (most recent call last):
File "sample.py", line 30, in <module>
k = tocsv[0].keys()
KeyError: 0
$python sample.py回溯(最后一次调用):
文件“sample.py”,第30行,在
k=tocsv[0]。键()
关键错误:0
有没有办法解决这个问题并将数据输入到CSV文件中,而不使用重复项?使用
findall
可能会简化一下:
In [20]: x = """
...: <CATALOG>
...: <CD>
...: <COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
...: <COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
...: <COLUMN NAME='COUNTRY'>USA</COLUMN>
...: <COLUMN NAME='COMPANY'>Columbia</COLUMN>
...: <COLUMN NAME='PRICE'>10.90</COLUMN>
...: <COLUMN NAME='YEAR'>1985</COLUMN>
...: </CD>
...: <CD>
...: <COLUMN NAME='TITLE'>Hide your heart</COLUMN>
...: <COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
...: <COLUMN NAME='COUNTRY'>UK</COLUMN>
...: <COLUMN NAME='COMPANY'>CBS Records</COLUMN>
...: <COLUMN NAME='PRICE'>9.90</COLUMN>
...: <COLUMN NAME='YEAR'>1988</COLUMN>
...: </CD>
...: <CD>
...: <COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
...: <COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
...: <COLUMN NAME='COUNTRY'>USA</COLUMN>
...: <COLUMN NAME='COMPANY'>RCA</COLUMN>
...: <COLUMN NAME='PRICE'>9.90</COLUMN>
...: <COLUMN NAME='YEAR'>1982</COLUMN>
...: </CD>
...: </CATALOG>"""
In [21]:
In [21]: xdata = fromstring(x)
In [22]: results = []
In [23]: for cd in xdata.findall('.//CD'):
...: each_result = {}
...: for each in cd.findall('.//COLUMN'):
...: each_result[each.attrib.get('NAME')] = each.text
...: results.append(each_result)
首先,我想你指的是
orglist[0].keys()
,而不是tocsv[0].keys()
。这将解决您的错误
根据你的第二个问题是:
有没有办法解决这个问题,并将数据放入CSV文件中,而无需重复
答案是肯定的,您可以使用pandas.DataFrame
在三行代码中实现这一点,如下所示:
>>> import pandas as pd
>>> df = pd.DataFrame(orglist)
>>> df.drop_duplicates(inplace=True)
>>> print(df)
编辑
因此,您的代码应该如下所示:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import pandas as pd
tree = ET.parse('sample.xml')
root = tree.getroot()
orglist = []
for child in root:
orgdata = {}
for sub in child:
if sub.attrib.get('NAME') == 'TITLE':
orgdata['TITLE'] = sub.text
if sub.attrib.get('NAME') == 'ARTIST':
orgdata['ARTIST'] = sub.text
if sub.attrib.get('NAME') == 'COUNTRY':
orgdata['COUNTRY'] = sub.text
if sub.attrib.get('NAME') == 'COMPANY':
orgdata['COMPANY'] = sub.text
if sub.attrib.get('NAME') == 'PRICE':
orgdata['PRICE'] = sub.text
if sub.attrib.get('NAME') == 'YEAR':
orgdata['YEAR'] = sub.text
tocsv = orgdata
orglist.append(orgdata)
df = pd.DataFrame(orglist)
df.drop_duplicates(inplace=True)
print(df)
将打印:
ARTIST COMPANY COUNTRY PRICE TITLE YEAR
0 Bob Dylan Columbia USA 10.90 Empire Burlesque 1985
1 Bonnie Tyler CBS Records UK 9.90 Hide your heart 1988
2 Dolly Parton RCA USA 9.90 Greatest Hits 1982
谢谢,解决方案很有效。对于副本,我尝试使用pandas,它工作得很好(比我的其他解决方案更好),但它每次都会打印出标题和值。我尝试了另一种解决方案:“[link]()”,但仍然不起作用。任何建议。我编辑了我的答案。。。希望这能回答您的问题:)此解决方案有效。我可以删除重复的内容。但是,我不应该使用findall。我可以用find或其他函数来完成吗?非常感谢。
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import pandas as pd
tree = ET.parse('sample.xml')
root = tree.getroot()
orglist = []
for child in root:
orgdata = {}
for sub in child:
if sub.attrib.get('NAME') == 'TITLE':
orgdata['TITLE'] = sub.text
if sub.attrib.get('NAME') == 'ARTIST':
orgdata['ARTIST'] = sub.text
if sub.attrib.get('NAME') == 'COUNTRY':
orgdata['COUNTRY'] = sub.text
if sub.attrib.get('NAME') == 'COMPANY':
orgdata['COMPANY'] = sub.text
if sub.attrib.get('NAME') == 'PRICE':
orgdata['PRICE'] = sub.text
if sub.attrib.get('NAME') == 'YEAR':
orgdata['YEAR'] = sub.text
tocsv = orgdata
orglist.append(orgdata)
df = pd.DataFrame(orglist)
df.drop_duplicates(inplace=True)
print(df)
ARTIST COMPANY COUNTRY PRICE TITLE YEAR
0 Bob Dylan Columbia USA 10.90 Empire Burlesque 1985
1 Bonnie Tyler CBS Records UK 9.90 Hide your heart 1988
2 Dolly Parton RCA USA 9.90 Greatest Hits 1982