Python PyQt5用于刮取IMDb网页
我现在已经开始使用python进行Web抓取,我想从中抓取图像。这是最重要的。 这是我尝试过的代码,因为它涉及JavaScriptPython PyQt5用于刮取IMDb网页,python,pyqt5,Python,Pyqt5,我现在已经开始使用python进行Web抓取,我想从中抓取图像。这是最重要的。 这是我尝试过的代码,因为它涉及JavaScript import bs4 as bs import sys import urllib.request from PyQt5.QtWebEngineWidgets import QWebEnginePage from PyQt5.QtWidgets import QApplication from PyQt5.QtCore import QUrl class Page
import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
print('Load finished')
def Callable(self, html_str):
self.html = html_str
self.app.quit()
def main():
page = Page('https://www.imdb.com/name/nm0005683/mediaviewer/rm2073384192')
soup = bs.BeautifulSoup(page.html, 'html.parser')
imagetag = soup.find('div', id='photo-container')
print (imagetag)
if __name__ == '__main__': main()
这段代码实际上来自,我只是修改了链接
我得到的错误是什么
js: Uncaught TypeError: Cannot read property 'x' of undefined
Load finished
<div id="photo-container"></div>
js:uncaughttypeerror:无法读取未定义的属性“x”
装载完成
我不知道实际的错误是什么,的内容没有显示。我试过用谷歌搜索错误,但找不到任何有助于这种情况的东西。此外,如果我应该尝试任何其他方法来刮取图像而不是这个,我也愿意接受这些建议
PS:我也是StackOverFlow的新手,所以如果这里有任何不违反规则的地方,我可以根据需要编辑问题。您可能会想使用网络频道来完成实际工作,但下面将向您展示如何访问您要查找的图像。我会把网络频道的研究留给你
import sys
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl, QTimer
class Page(QWebEnginePage):
def __init__(self, parent):
QWebEnginePage.__init__(self, parent)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
def _on_load_finished(self):
print('Load finished')
QTimer.singleShot(1000, self._after_loading) # load finished does not mean rendered..may need to wait here
QTimer.singleShot(5000, self._exit)
def _after_loading(self):
print('_after_loading')
js = '''console.log('javascript...');
var images = document.querySelectorAll('#photo-container img');
console.log('images ' + images);
console.log('images ' + images.length);
for (var i = 0; i < images.length; i++)
{
var image = images[i];
console.log(image.src);
}
var element = document.querySelector('body');
//console.log(element.innerHTML); // If you uncomment this you'll see the the photo-container is still empty
'''
self.runJavaScript(js)
print('_after_loading...done')
def _exit(self):
print('_exit')
QApplication.instance().quit()
def javaScriptConsoleMessage(self, level: QWebEnginePage.JavaScriptConsoleMessageLevel, message: str, lineNumber: int, sourceID: str):
print(message)
def main():
app = QApplication(sys.argv)
w = QWebEngineView()
w.setPage(Page(w))
w.load(QUrl('https://www.imdb.com/name/nm0005683/mediaviewer/rm2073384192'))
w.show()
app.exec_()
if __name__ == '__main__': main()
导入系统
从PyQt5.QtWebEngineWidgets导入QWebEngineView、QWebEnginePage
从PyQt5.QtWidgets导入QApplication
从PyQt5.QtCore导入QUrl,QTimer
类页(QWebEnginePage):
定义初始化(自身,父级):
QWebEnginePage.\uuuuu初始化\uuuuuuuuuu(自,父)
self.html=“”
self.loadFinished.connect(self.\u on\u load\u finished)
def加载完成(自):
打印('加载完成')
QTimer.singleShot(1000,self._加载后)#加载完成并不意味着渲染..可能需要在此处等待
单发定量计时器(5000,自动退出)
加载后的def(自):
打印(“加载后打印”)
js=''console.log('javascript…');
var images=document.queryselectoral(“#photo container img”);
console.log('images'+images);
console.log('images'+images.length);
对于(var i=0;i打印(imagetag)
未显示
的完整内容,因为错误BeautifulSoup将处理原始html。页面上呈现的内容通常由javascript动态填充。如果您查看页面源代码,就会发现情况就是这样。要获得实际内容,您需要通过javascript在页面中执行此操作。