Python 使用selenium下载整个html页面内容_Python_Html_Selenium_Selenium Webdriver

Python 使用selenium下载整个html页面内容

python html selenium selenium-webdriver

Python 使用selenium下载整个html页面内容,python,html,selenium,selenium-webdriver,Python,Html,Selenium,Selenium Webdriver,我需要下载html页面的全部内容图像、css、js 第一个选项：通过urllib或requests下载页面提取页面信息。通过beutiful soup或lxml 下载所有链接和将原始页面中的链接编辑为相对缺点多步骤下载的页面将永远不会与远程页面相同可能是由于js或ajax内容造成的第二选项一些作者建议自动化webbrowser来下载页面；因此，java scrip和ajax将在下载之前执行我想使用这个选项第一次尝试 # No more; the browser

我需要下载html页面的全部内容

图像、css、js

第一个选项：

通过
```
urllib
```
或
```
requests
```
下载页面
提取页面信息。通过
```
beutiful soup
```
或
```
lxml
```
下载所有链接和
将原始页面中的链接编辑为相对

缺点

多步骤
下载的页面将永远不会与远程页面相同<代码>可能是由于js或ajax内容造成的

第二选项

一些作者建议自动化webbrowser来下载页面；因此，java scrip和ajax将在下载之前执行

我想使用这个选项

第一次尝试

# No more; the browser is opened at the given url, no download occurs.

因此，我复制了这段

selenium

代码来执行两个步骤：

在firefox浏览器中打开URL
下载该页面

代码

import os
from selenium import webdriver 
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False )
profile.set_preference('browser.download.dir', os.environ["HOME"])
profile.set_preference("browser.helperApps.alwaysAsk.force", False )
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/html,text/webviewhtml,text/x-server-parsed-html,text/plaintext,application/octet-stream');

browser = webdriver.Firefox(profile)

def open_new_tab(url):
    ActionChains(browser).send_keys(Keys.CONTROL, "t").perform()
    browser.get(url)
    return browser.current_window_handle

# call the function
open_new_tab("https://www.google.com")
# Result: the browser is opened t the given url, no download occur

结果

# No more; the browser is opened at the given url, no download occurs.

不幸的是，没有下载，它只是在提供的

url

打开浏览器（第一步）

第二次尝试

# No more; the browser is opened at the given url, no download occurs.

我认为在下载页面时要通过单独的功能；所以我添加了这个函数

添加的功能

def save_current_page():      
    ActionChains(browser).send_keys(Keys.CONTROL, "s").perform()

# call the function
open_new_tab("https://www.google.com")
save_current_page()

结果

# No more; the browser is opened at the given url, no download occurs.

问题

如何通过selenium自动下载网页？？？

您到底想做什么？提取页面源代码？Selenium正在做它应该做的事情：打开一个网页进行用户交互。我想通过默认的浏览器下载程序下载该网页。我知道我可以通过两个步骤下载页面：1/通过

selenium

提取源代码，或者无论如何2/将源代码写入本地文件。但我需要避免这种方法。我需要模拟普通用户，下载整个页面内容一次（图像、css、js），您尝试过请求吗？它可以获取网页上的所有信息。

请求

模块需要很多步骤，1/下载页面，2/通过

lxml

或

beutifulsoup

提取所有链接，3/下载提取的链接，4/编辑原始页面中的链接到相对。。。另外还有提取java脚本和ajax的问题。您可以在下载[www.google.com]中尝试

请求

模块。下载的页面将与真实页面完全不同；尽管您成功提取了所有信息。