Python爬虫程序-html.fromstring

Python爬虫程序-html.fromstring,python,web-crawler,Python,Web Crawler,我正试图用这段代码解析网页 ac = requests.get('link....') html_text = ac.text lx = html.fromstring(html_text) 当我运行这段代码时,我得到了这个错误 Traceback (most recent call last): File "Crawler.py", line 197, in <module> cnx.close() File "Crawler.py", line 46, in RequestPa

我正试图用这段代码解析网页

ac = requests.get('link....')
html_text = ac.text
lx = html.fromstring(html_text)
当我运行这段代码时,我得到了这个错误

Traceback (most recent call last):
File "Crawler.py", line 197, in <module>
cnx.close()
File "Crawler.py", line 46, in RequestPage
lx = html.fromstring(html_text)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 867, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:76696)
File "src\lxml\parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:115101)
File "src\lxml\parser.pxi", line 1711, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:113677)
File "src\lxml\parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:107847)
File "src\lxml\parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:102150)
File "src\lxml\parser.pxi", line 694, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:103800)
File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:102888)
lxml.etree.XMLSyntaxError: line 1843: Tag ie:menuitem invalid
回溯(最近一次呼叫最后一次):
文件“Crawler.py”,第197行,在
cnx.close()
文件“Crawler.py”,第46行,在请求页面中
lx=html.fromstring(html\u文本)
文件“C:\Python27\lib\site packages\lxml\html\\uuuu init\uuuu.py”,第867行,格式为fromstring
doc=document\u fromstring(html,parser=parser,base\u url=base\u url,**kw)
文件“C:\Python27\lib\site packages\lxml\html\\uuuu init\uuuu.py”,第752行,位于文档\u fromstring中
value=etree.fromstring(html,解析器,**kw)
文件“src\lxml\lxml.etree.pyx”,第3213行,位于lxml.etree.fromstring(src\lxml\lxml.etree.c:76696)中
lxml.etree.\u parseMemoryDocument(src\lxml\lxml.etree.c:115101)中的文件“src\lxml\parser.pxi”,第1830行
文件“src\lxml\parser.pxi”,第1711行,在lxml.etree.\u parseDoc(src\lxml\lxml.etree.c:113677)中
lxml.etree.\u BaseParser.\u parseUnicodec(src\lxml\lxml.etree.c:107847)中的文件“src\lxml\parser.pxi”,第1051行
文件“src\lxml\parser.pxi”,第584行,在lxml.etree.\u ParserContext.\u handleParseResultDoc(src\lxml\lxml.etree.c:102150)中
文件“src\lxml\parser.pxi”,第694行,位于lxml.etree.\u handleParseResult(src\lxml\lxml.etree.c:103800)
文件“src\lxml\parser.pxi”,第633行,在lxml.etree中。\u raiseParserError(src\lxml\lxml.etree.c:102888)
lxml.etree.xmlsyntaxer错误:第1843行:标记ie:menuitem无效
我找到了导致错误的html标记:

<ie:menuitem id="MSOMenu_Help" iconsrc="/_layouts/images/HelpIcon.gif" onmenuclick="MSOWebPartPage_SetNewWindowLocation(MenuWebPart.getAttribute('helpLink'), MenuWebPart.getAttribute('helpMode'))" text="Help" type="option" style="display:none">

</ie:menuitem>

您找到了导致错误的HTML标记,但您是否修复了它?如果没有,请尝试以下操作:

ac=requests.get('link…)
lx=html.fromstring(ac.content)
valueOfHTMLTag=lx.xpath('//TAG[@class/id=“Name”]/text())

您更改的位置:

  • 要获取其值的标记中的标记
  • 选择标记的类或id
  • 标记的id/类名
这将返回一个数组,其中包含该标记的值以及正确的class/id


希望这有帮助

您可能需要定义一个自定义元素,以便lxml能够理解Sharepoint的魔力:或者使用BeautifulSoup模块作为替代,它知道如何处理命名空间元素。