Python 从lxml树中提取数据序言：_Python_Xpath_Lxml

Python 从lxml树中提取数据序言：

python xpath

Python 从lxml树中提取数据序言：,python,xpath,lxml,Python,Xpath,Lxml,遗憾的是，它不能完全工作，因此我无法从lxml树中提取我想要的数据。我对这个具体的案例并不特别感兴趣；我在寻找更一般的答案 import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from lxml import html class Render(QWebPage): def __init__(self, url): self

遗憾的是，它不能完全工作，因此我无法从lxml树中提取我想要的数据。我对这个具体的案例并不特别感兴趣；我在寻找更一般的答案

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  
  
  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit() 

url = 'http://pycoders.com/archive/'  
#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)

《指南》继续做：

archive_links = tree.xpath('//divass="campaign"]/a/@href')

这会导致错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src\lxml\lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src\lxml\lxml.etree.c:59353)
  File "src\lxml\xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:171227)
  File "src\lxml\xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:170184)
lxml.etree.XPathEvalError: Invalid expression

可能是打字错误。请尝试以下方法：

archive_links = tree.xpath('//div[class="campaign"]/a/@href')

或：

该语法更有意义，但遗憾的是，它为我返回了

archive\u links=[]

。@MitchellvanZuylen，这是因为您只请求初始页面源代码，而要获取链接，您需要等待

JavaScript

执行完成根据指南，

Render

类等待JS执行。我是否误解了指南，指南是否错了，或者您是否错过了

渲染

类？@MitchellvanZuylen，或者可能我的假设是错的：）实际上，代码行应该是

archive\u links=tree.xpath（//div[@class=“campaign”]/a/@href）

。注意

@class

，而不仅仅是

class

。

archive_links = tree.xpath('//div[class="campaign"]/a/@href')

archive_links = tree.xpath('//div[@class="campaign"]/a/@href')