xml分析错误:格式不正确<;无效令牌>;用python

xml分析错误:格式不正确<;无效令牌>;用python,python,xml,parsing,sax,Python,Xml,Parsing,Sax,嗨,我正在抓取XML文件。对于HTML,我使用了scrapy,对于XML,我决定使用XML.sax解析它 下面是一个示例代码(不要将其视为真实示例),仅用于查看我的疑问: from xml.sax.handler import ContentHandler import xml.sax xmlFilePath = 'users/documents/jobstext.xml' try: parser = xml.sax.make_parser( ) parser.parse(o

嗨,我正在抓取XML文件。对于HTML,我使用了scrapy,对于XML,我决定使用
XML.sax
解析它

下面是一个示例代码(不要将其视为真实示例),仅用于查看我的疑问:

from xml.sax.handler import ContentHandler
import xml.sax

xmlFilePath = 'users/documents/jobstext.xml'

try:
    parser = xml.sax.make_parser( )
    parser.parse(open(xmlFilePath))

except (xml.sax.SAXParseException), e:
        print "*** PARSER error: %s" % e
        print e,"What is the error actually >>>>"  
以下是XML代码:

<?xml version="1.0" encoding="utf-8"?>
<jobs>
  <reader><![CDATA[Identity Group]]></reader>
  <readerUrl><![CDATA[http://www.example.com]]></readerUrl>

  <job>
    <title><![CDATA[Architect - OT]]></title>
    <category><![CDATA[LTC/SNF]]></category>
    <jobId><![CDATA[139693]]></jobId>
    <specialization><![CDATA[LTC/SNF]]></specialization>
    <positionType><![CDATA[Travel]]></positionType>
    <description><![CDATA[<DIV>OT&nbsp;needed for a SNF in&nbsp;Oregon.&nbsp; Oregon is a dramatic land of many changes. From the rugged Oregon seacoast, the high mountain passes of the country for Travel Allied Professionals and Travel Nurses. Our clients are among the most prestigious healthcare facilities in the country.</DIV>
<DIV>&nbsp;</DIV>
 </description>
<P style="MARGIN: 0in 0in 0pt" class=MsoNormal><FONT size=3><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"><SPAN style="mso-spacerun: yes">&nbsp; </SPAN>Position will manage 24 ED Rooms with 24/7 accountability<o:p></o:p></FONT></SPAN></FONT></P>
<P style="MARGIN: 0in 0in 0pt" class=MsoNormal><FONT size=3><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"> <SPAN style="mso-spacerun: yes">&nbsp;</SPAN>55 FTEs <o:p></o:p></FONT></SPAN></FONT></P>
  </job>
</jobs>

您的
说明
没有结束标记,其中的CDATA部分永远不会终止…尽管我预计它会在文档末尾出错,而不是在该元素的第三行数据上。

因为问题已更改

XML属性必须被引用


例如:
class=MsoNormal
应该是
class=“MsoNormal”

谢谢您的回复是的,实际上我更新了结束标记以进行说明。但实际上,在我的xml中,cdata没有结束标记。但是错误在标记处。请告诉我p标记中有什么错误,以及如何避免错误\@shivakrishna-CDATA部分仍然没有终止。不介意我的xml中没有任何CDATA的结束标记,但一切都正常。请关注para标记,因为xml中的一切都很好,但在我上面粘贴在第150列的para标记中,它显示了一个错误。因此,如何忽略此类错误实际上,主题是我的xml文件有许多与作业相关的数据,因此,这里我只粘贴了与一个作业相关的xml。我在上面编辑的一些没有逗号的标记也会执行同样的操作。好的,我现在只粘贴了段落标记,直接从代码复制而不编辑,你能告诉我哪里不对吗now@shivakrishna-如果将XML缩减为

,则仍会出现该错误。属性值必须被引用。您可能还有其他错误,但这是第一个错误。哦,谢谢昆汀,我们可以在执行过程中从p标记中删除“?”吗(我希望执行时不会出现错误,因为没有其他错误超过p标记。)@Quentin:上面的两个标记(例如)执行时不会出现错误,如u所示,wiht class=“Msnormal”
*** PARSER error: users/documents/jobstext.xml:13:150: not well-formed <invalid token>
users/documents/jobstext.xml:13:150: not well-formed <invalid token> What is the error actually >>>>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial">THE MOST COMPETITIVE RATES IN NM .....<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial">Busy <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:place w:st="on"><st1:PlaceName w:st="on">Acute</st1:PlaceName> <st1:PlaceName w:st="on">Care</st1:PlaceName> <st1:PlaceType w:st="on">Hospital</st1:PlaceType></st1:place> needs Occupational Therapists.&nbsp; Experience with </SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Ortho, Neuro, vestibular balance, aquatic a plus!<SPAN style="COLOR: black">&nbsp; New grads welcome.<SPAN style="mso-spacerun: yes">&nbsp; </SPAN>Signon Bonus and help with relocation.<SPAN style="mso-spacerun: yes">&nbsp; </SPAN>For more details please call or email Carole 800 995 2673 X1329 or <A href="mailto:cs@coremedicalgroup.com"><SPAN style="mso-bidi-font-weight: bold; mso-bidi-font-size: 12.0pt">cs@coremedicalgroup.com</SPAN></A><o:p></o:p></SPAN></SPAN></P>