Python和lxml.html按id获取元素输出问题_Python_Html_Html Parsing_Lxml_Lxml.html

Python和lxml.html按id获取元素输出问题

python html

Python和lxml.html按id获取元素输出问题,python,html,html-parsing,lxml,lxml.html,Python,Html,Html Parsing,Lxml,Lxml.html,我目前正在尝试从html文件中获取数据。看起来我正在使用的代码工作正常，但并不像我预期的那样。我可以得到一些项目，但不是全部，我想知道这是否与我试图读取的文件的大小有关我目前正在尝试解析的源代码这一页有4500行，所以它的大小相当不错。我一直在使用这个页面，因为我想确保代码在大文件上工作我使用的代码是： import lxml.html import lxml import urllib2 webHTML = urllib2.urlopen('http://hobbyking.com/h

我目前正在尝试从html文件中获取数据。看起来我正在使用的代码工作正常，但并不像我预期的那样。我可以得到一些项目，但不是全部，我想知道这是否与我试图读取的文件的大小有关

我目前正在尝试解析的源代码

这一页有4500行，所以它的大小相当不错。我一直在使用这个页面，因为我想确保代码在大文件上工作

我使用的代码是：

import lxml.html
import lxml
import urllib2

webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
    print element.text_content()

当我使用'mm3'或接近顶部的某个元素时，这会给出预期的输出，但如果我使用'productDetails'元素时，则不会得到任何输出。至少在我当前的设置中是这样。

恐怕

lxml.html

无法解析这个特定的html源代码。它将带有

id=“productDetails”

的

h3

标记解析为空元素（在a中）：

印刷品：

Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.

outrunner

...

非常感谢你的帮助！我会继续尝试使用另一个答案。我没有意识到空元素是默认的恢复模式。我希望我读得更深一点，并且在花几个小时试图自己解决它之前知道这一点@当然可以，谢谢。仅供参考，我提到了

recover

模式只是为了指出

lxml.html

默认使用它，没有简单的方法可以告诉它更宽松。我完全理解。我只是没有在文档中看到这一点。这是一个巨大的帮助，因为我经常看到这个空元素，但无法理解它。

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')

for element in soup.find(id='productDetails').find_all():
    print element.text

Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.

outrunner

...