使用ElementTree在python中解析xml

使用ElementTree在python中解析xml,python,xml,parsing,xml-parsing,Python,Xml,Parsing,Xml Parsing,我对python非常陌生,我需要解析一些脏的xml文件,这些文件首先需要清理 我有以下python代码: import arff import xml.etree.ElementTree import re totstring="" with open('input.sgm', 'r') as inF: for line in inF: string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line) tots

我对python非常陌生,我需要解析一些脏的xml文件,这些文件首先需要清理

我有以下python代码:

import arff
import xml.etree.ElementTree
import re

totstring=""

with open('input.sgm', 'r') as inF:
    for line in inF:
        string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
    totstring+=string


data=xml.etree.ElementTree.fromstring(totstring)

print data

file.close
导入arff
导入xml.etree.ElementTree
进口稀土
totstring=“”
以open('input.sgm','r')作为inF:
对于inF中的行:
string=re.sub(“[^0-9a-zA-Z/\s=!-\”]+”,第行)
totstring+=字符串
data=xml.etree.ElementTree.fromstring(totstring)
打印数据
file.close
它解析:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;C T
&#22;&#22;&#1;f0704&#31;reute
u f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>
<TEXT>&#2;
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
    There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.
    In view of the lower quality over recent weeks farmers have
sold a good part of their cocoa held on consignment.
    Comissaria Smith said spot bean prices rose to 340 to 350
cruzados per arroba of 15 kilos.
    Bean shippers were reluctant to offer nearby shipment and
only limited sales were booked for March shipment at 1,750 to
1,780 dlrs per tonne to ports to be named.
    New crop sales were also light and all to open ports with
June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs
under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs
per tonne FOB.
    Routine sales of butter were made. March/April sold at
4,340, 4,345 and 4,350 dlrs.
    April/May butter went at 2.27 times New York May, June/July
at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
    Destinations were the U.S., Covertible currency areas,
Uruguay and open ports.
    Cake sales were registered at 785 to 995 dlrs for
March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times
New York Dec for Oct/Dec.
    Buyers were the U.S., Argentina, Uruguay and convertible
currency areas.
    Liquor sales were limited with March/April selling at 2,325
and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New
York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York
Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith
said.
    Total Bahia sales are currently estimated at 6.13 mln bags
against the 1986/87 crop and 1.06 mln bags against the 1987/88
crop.
    Final figures for the period to February 28 are expected to
be published by the Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
 Reuter
&#3;</BODY></TEXT>
</REUTERS>

1987年2月26日15:01:01.79
热可可
萨尔瓦多乌拉圭
.CT
.f0704和#31;团聚
u f BC-BAHIA-COCOA-REVIEW 02-26 0105

巴伊亚可可评论
萨尔瓦多,2月26日——本周,萨尔瓦多持续有阵雨
巴伊亚可可区,缓解了年初以来的干旱
1月份,未来一年的前景有所改善,
虽然正常湿度水平尚未恢复,
Comissaria Smith在其每周评论中说。
干旱期意味着临时机场将在今年晚些时候到达。
截至2月22日的一周内,共收到155221件行李
60公斤,本季累计为5.93公斤
去年同一阶段的mln为5.81。似乎又是这样
早些时候寄售的可可被包括在合同中
入境人数。
科米萨里亚·史密斯说,关于如何解决这个问题,仍有一些疑问
随着收割的进行,许多古老的可可作物仍然可用
几乎结束了。总巴伊亚作物估计
大约640万件箱包,销量接近620万件
几十万袋仍在农民手中,
中间商、出口商和加工商。
有人怀疑这种可可到底适合吃多少
由于托运人目前在出口方面遇到困难
获得+Bahia superior+证书。
鉴于近几周质量较低,农民们已经
出售了他们寄售的大部分可可。
Comissaria Smith表示,大豆现货价格升至340至350美元
每阿罗巴15公斤。
豆子托运人不愿意提供附近的货物和服务
仅预定了1750至1750年3月装运的有限销售额
每公吨1780德国里拉至指定港口。
新作物的销售也很清淡,所有这些都是为了开放港口
6月/7月分别达到1850和1880德国卢比以及35和45德国卢比
纽约7月、8月/9月1870、1875和1880 DLR
每公吨离岸价。
黄油按常规出售。三月/四月于
4340、4345和4350 DLR。
四月/五月黄油价格是纽约五月、六月/七月的2.27倍
电话:4400和4415,8月/9月,电话:4351至4450,以及
纽约9月和10月/12月的2.27和2.28倍,4480 dlrs和
Comissaria Smith说,12月纽约股市上涨了2.27倍。
目的地是美国,可转换货币区,
乌拉圭和开放港口。
年蛋糕销售额为785至995德国马克
3月/4月,5月785个DLR,8月753个DLR和0.39次
纽约12月10/12月。
买家是美国、阿根廷、乌拉圭和西班牙
货币区。
酒类销售有限,3月/4月销量为2325
6月/7月的2380德国盾,2375德国盾,是新的1.25倍
纽约7月、8月/9月的2400 DLR和纽约的1.25倍
9月和10月/12月,纽约12月的1.25倍,Comissaria Smith
说。
Bahia的总销量目前估计为613万袋
1986/87年度作物和1987/88年度106万袋
收成。
截至2月28日的最终数字预计为
由巴西可可贸易委员会出版
狂欢节于2月27日中午结束。
路透社

我现在如何才能从body标记内部获取文本

我看到的所有教程都依赖于直接从文件中读取xml,以便Elementtree.parse工作。当我试图从一个字符串解析时,这将不起作用,这会破坏我读过的很多教程


非常感谢

如果您不关心(可能很混乱的)XML文档的特定结构,只想快速获取给定标记/元素的内容,您可能需要尝试该模块


你的第一条线索可能是当你收到这样的信息时

>来自xml.etree导入元素树
>>>parse=ElementTree.parse('foo.xml')
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/usr/lib/python2.6/xml/etree/ElementTree.py”,第862行,解析中
parse(源,解析器)
文件“/usr/lib/python2.6/xml/etree/ElementTree.py”,第586行,解析中
提要(数据)
feed中的文件“/usr/lib/python2.6/xml/etree/ElementTree.py”,第1245行
self.\u parser.Parse(数据,0)
xml.parsers.expat.expat错误:对无效字符号的引用:第11行第0列
>>>
此错误来自XML源中的无效字符。您需要清除无效字符(请参阅我答案底部的
fix_xml.py

有了干净的XML之后,就很容易了。应使用
StringIO
将字符串视为文件:

>来自xml.etree导入元素树
>>>从StringIO导入StringIO
>>>text=open('foo.xml','r')。read()
>>>tree=ElementTree.parse(StringIO(text))
>>>tree.find(“//BODY”)
>>>tree.find('//BODY').text
巴伊亚可可区的阵雨持续了一周,缓解了一月初以来的干旱,并改善了未来一段时间的前景,尽管正常的湿度水平尚未恢复,\nComissaria Smith在其每周评论中说。\n干旱期意味着temporao将在今年晚些时候到达。\n截至2月22日的一周内,到达量为155221包,重量为60公斤,与去年同期的5.81包相比,本季累计到达量为5.93包。同样,竞争对手的数据中似乎也包括了早些时候交付的可可。\n Comissaria Smith说,随着收获实际上已经结束,如何仍能获得如此古老的可可,仍有一些疑问。Bahia作物总产量估计为640万袋,销售额为
import BeautifulSoup
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(totstring)

body = soup.find("body")

bodytext = body.text
(py26_default)[mpenning@Bucksnort ~]$ python fix_xml.py foo.xml
bar.xml
343 &#5;
347 &#5;
351 &#5;
359 &#22;
364 &#22;
369 &#1;
378 &#31;
444 &#2;
3393 &#3;
(py26_default)[mpenning@Bucksnort ~]$
from lxml.html import soupparser
from StringIO import StringIO
try:
    parser = XMLParser(ns_clean=True, recover=True)
    tree = ET.parse(StringIO(text), parser)
except UnicodeDecodeError:
    tree = soupparser.parse(StringIO(text))
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
C T
f0704reute
u f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>
<TEXT>
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
    There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.
    In view of the lower quality over recent weeks farmers have
sold a good part of their cocoa held on consignment.
    Comissaria Smith said spot bean prices rose to 340 to 350
cruzados per arroba of 15 kilos.
    Bean shippers were reluctant to offer nearby shipment and
only limited sales were booked for March shipment at 1,750 to
1,780 dlrs per tonne to ports to be named.
    New crop sales were also light and all to open ports with
June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs
under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs
per tonne FOB.
    Routine sales of butter were made. March/April sold at
4,340, 4,345 and 4,350 dlrs.
    April/May butter went at 2.27 times New York May, June/July
at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
    Destinations were the U.S., Covertible currency areas,
Uruguay and open ports.
    Cake sales were registered at 785 to 995 dlrs for
March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times
New York Dec for Oct/Dec.
    Buyers were the U.S., Argentina, Uruguay and convertible
currency areas.
    Liquor sales were limited with March/April selling at 2,325
and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New
York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York
Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith
said.
    Total Bahia sales are currently estimated at 6.13 mln bags
against the 1986/87 crop and 1.06 mln bags against the 1987/88
crop.
    Final figures for the period to February 28 are expected to
be published by the Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
 Reuter
</BODY></TEXT>
</REUTERS>
import xml.etree.ElementTree as ET
import sys
import re

class MyXMLParser(ET.XMLParser):

    rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")

    def feed(self,data):
        m = self.rx.search(data)
        if m is not None:
            target = m.group(1)
            if target:
                num = int(target)
            else:
                num = int(m.group(2), 16)
            if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
                   or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
                # is invalid xml character, cut it out of the stream
                print 'removing %s' % m.group()
                mstart, mend = m.span()
                mydata = data[:mstart] + data[mend:]
        else:
            mydata = data
        super(MyXMLParser,self).feed(mydata)


parser = MyXMLParser(encoding='utf-8')
xml_filename = sys.argv[1]
xml_etree = ET.parse(xml_filename, parser=parser)