Python 有没有办法解析网站内容的DOM树？_Python_Selenium_Web Scraping_Phantomjs

Python 有没有办法解析网站内容的DOM树？

python selenium web-scraping phantomjs

Python 有没有办法解析网站内容的DOM树？,python,selenium,web-scraping,phantomjs,Python,Selenium,Web Scraping,Phantomjs,有一些用于从xml内容解析dom树的包，如但我不想针对xml，只针对html网站页面内容 from htmldom import htmldom dom = htmldom.HtmlDom( "http://www.yahoo.com" ).createDom() # Find all the links present on a page and prints its "href" value a = dom.find( "a" ) for link in a: print( lin

有一些用于从xml内容解析dom树的包，如

但我不想针对xml，只针对html网站页面内容

from htmldom import htmldom
dom = htmldom.HtmlDom( "http://www.yahoo.com" ).createDom()
# Find all the links present on a page and prints its "href" value
a = dom.find( "a" )
for link in a:
    print( link.attr( "href" ) )

但为此，我得到了一个错误：

Error while reading url: http://www.yahoo.com
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/htmldom/htmldom.py", line 333, in createDom
    raise Exception
Exception

读取url时出错：http://www.yahoo.com 回溯（最近一次呼叫最后一次）：文件“”，第1行，在 createDom中的文件“/usr/local/lib/python2.7/dist packages/htmldom/htmldom.py”，第333行引发异常例外情况看，我已经选中了BeautifulSoup，但这不是我想要的。Beautifulsoup仅适用于html页面。如果使用Javascript动态加载页面内容，则会失败。我不想使用

getElementByClassName

和类似的方法来解析元素。但是

dom.children（0）.children（1）

类似的东西

那么有没有像使用headless browser这样的方法，selenium可以解析整个DOM树结构，通过child和subchild我可以访问targget元素？

是的，但它不够简单，无法将代码包含在So帖子中。不过，你的思路是对的

基本上，您需要使用自己选择的无头渲染器（例如Selenium）下载所有资源并执行javascript。在那里重新发明轮子真的没有用

然后，您需要将HTML从headless渲染器返回到PageReady事件中的文件（我使用过的每个headless浏览器都提供此功能）。此时，您可以在该文件上使用BeautifulSoup来导航DOM。BeautifulSoup确实支持基于子对象的遍历，如您所愿：

提供您可能需要的一切。你可以从

html = driver.find_element_by_tag_name("html")

或

然后从那里出发

body.find_element_by_xpath('/*[' + str(x) + ']')

这相当于“

body.children（x-1）

”。除此之外，您不需要使用BeautifulSoup或任何其他DOM遍历框架，但您当然可以通过获取页面源代码并让另一个库（如BeautifulSoup）对其进行解析：

soup = BeautifulSoup(driver.page_source)
soup.html.children[0] #...

soup = BeautifulSoup(driver.page_source)
soup.html.children[0] #...