使用python使用javascript从网页中抓取数据_Python_Python 3.x_Web Scraping_Beautifulsoup_Python Requests Html

使用python使用javascript从网页中抓取数据

python python-3.x web-scraping

使用python使用javascript从网页中抓取数据,python,python-3.x,web-scraping,beautifulsoup,python-requests-html,Python,Python 3.x,Web Scraping,Beautifulsoup,Python Requests Html,我正试图从网页上刮去标题。起初，我尝试使用BeautifulSoup，但发现没有Javascript页面本身无法加载。因此，我正在使用从Google上找到的一些使用请求html库的代码： from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.a

我正试图从网页上刮去标题。起初，我尝试使用BeautifulSoup，但发现没有Javascript页面本身无法加载。因此，我正在使用从Google上找到的一些使用请求html库的代码：

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")

soup.find_all('h1')

但总有这样一个错误：

D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
    resp.html.render()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
    content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
  File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
    content = await page.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
    return await frame.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
    '''.strip())
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
    raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.

Process finished with exit code 1

D:\Python\TitleSraping\venv\Scripts\Python.exe“D:/Python/TitleSraping/venv/Text Scraping.py”
回溯（最近一次呼叫最后一次）：
evaluateHandle中的文件“D:\Python\titleRaping\venv\lib\site packages\pyppeteer\execution\u context.py”，第106行
“用户手势”：正确，
pyppeteer.errors.NetworkError:协议错误（Runtime.callFunctionOn）：找不到具有指定id的上下文
在处理上述异常期间，发生了另一个异常：
回溯（最近一次呼叫最后一次）：
文件“D:/Python/TitleSraping/venv/Text Scraping.py”，第5行，在
resp.html.render（）
文件“D:\Python\TitleSraping\venv\lib\site packages\requests\u html.py”，第598行，在render中
内容、结果、页面=self.session.loop.运行直到完成（self.\u异步\u呈现（url=self.url，script=script，sleep=sleep，wait=wait，content=self.html，reload=reload，scrolldown=scrolldown，timeout=timeout，keep\u page=keep\u page））
文件“D:\Program Files（x86）\Python\lib\asyncio\base\u events.py”，第584行，在运行\u直到完成
返回future.result（）
文件“D:\Python\TitleSraping\venv\lib\site packages\requests\u html.py”，第531行，在异步渲染中
content=wait page.content（）
内容中第780行的文件“D:\Python\TitleSraping\venv\lib\site packages\pyppeteer\page.py”
return wait frame.content（）返回
内容中第379行的文件“D:\Python\TitleSraping\venv\lib\site packages\pyppeteer\frame\u manager.py”
''.strip（））
文件“D:\Python\TitleSraping\venv\lib\site packages\pyppeteer\frame\u manager.py”，第295行，在evaluate中
pageFunction，*args，force\u expr=force\u expr）
文件“D:\Python\TitleSraping\venv\lib\site packages\pyppeteer\execution\u context.py”，第55行，在evaluate中
pageFunction，*args，force\u expr=force\u expr）
evaluateHandle中的文件“D:\Python\titleRaping\venv\lib\site packages\pyppeteer\execution\u context.py”，第109行
_重写错误（e）
文件“D:\Python\TitleSraping\venv\lib\site packages\pyppeteer\execution\u context.py”，第238行，在\u rewriteError中
提升类型（错误）（消息）
pyppeteer.errors.NetworkError:执行上下文被破坏，很可能是因为导航。
进程已完成，退出代码为1

有人知道这是什么意思吗？我对这一点很陌生，所以如果我不正确地使用任何术语，我深表歉意。

似乎是底层库

puppeter

中的一个bug，由处理一些javascript引起。这里有一个解决方法，也许会有帮助

resp.html.render（sleep=1，keep_page=True）

您需要加载JS，因为如果不加载它，html代码就不会加载。您可以使用Selenium，试试Selenium

Seleneum是一个库，允许程序通过控制浏览器与网页交互

下面是一个例子

对于其他人的问题。

正如Ivan所说，这里有完整的代码：sleep=1，keep\u page=True，玩这个把戏

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))

答复：

[<title>
    Milled wheat and wheat flour produced</title>]

[
磨碎的小麦和生产的小麦粉]

hmm，我希望这就是我得到的，但我似乎仍然得到了相同的错误。你用我的代码试过了吗？我在我的控制台（Python3.7）中运行，它的工作就像一个符咒。请现在粘贴代码以修复它：）因此。。。我确实试过你的代码。。。有时有效有时无效，老实说，我不知道为什么我还要尝试复制它我一个接一个地尝试了10次，成功了…如果你的互联网速度慢到5秒，试着让睡眠=2（2秒）。sleep–初始渲染后睡眠时间的整数（如果提供的话）。我尝试了它，但似乎仍然得到了类似的错误，您可以尝试增加

sleep

参数。如果你的页面很重，而机器运行速度很慢，这会有所帮助。嗯，我正试图按照本教程进行操作，但不确定它是如何工作的。问题特别在于你要刮取的页面，因为它具有防止刮取器的安全性。嗯，我正试图按照本教程进行操作，但不确定它是如何工作的