使用Python保存Google Python教程的脱机副本。_Python_Python 2.7_Beautifulsoup

使用Python保存Google Python教程的脱机副本。

python python-2.7

使用Python保存Google Python教程的脱机副本。,python,python-2.7,beautifulsoup,Python,Python 2.7,Beautifulsoup,我正在尝试编写python代码来保存“Google python教程”的脱机副本，这样即使我没有连接到internet，也可以访问该文件。为此，我导入了以下库-urllib、re、beautifulsou、OS 其思想是识别导航路径下的所有url（class-gctoc），然后循环遍历每个url并将html文件保存在本地。下面是相同的代码我的问题是, 下载的html文件尝试从中访问css和js文件在线的如何通过程序下载这些文件目前整个计划似乎很麻烦。你能建议吗如何改进它？例如，我喜欢

我正在尝试编写python代码来保存“Google python教程”的脱机副本，这样即使我没有连接到internet，也可以访问该文件。为此，我导入了以下库-urllib、re、beautifulsou、OS 其思想是识别导航路径下的所有url（class-gctoc），然后循环遍历每个url并将html文件保存在本地。下面是相同的代码

我的问题是,

下载的html文件尝试从中访问css和js文件在线的如何通过程序下载这些文件

目前整个计划似乎很麻烦。你能建议吗如何改进它？例如，我喜欢避免使用Re和BeautifulSoup来提取'gctoc'类下的链接

import urllib
import re
from BeautifulSoup import *
import os

#The URL from which the tags are to be scraped from
url = 'https://developers.google.com/edu/python/'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

#The scraped tags contain relative path. Need to append with the baseurl for downloading
base_url = 'https://developers.google.com'
#save path
save_path = 'D:\My Local Directory'
urllist = list()

# Retreive all anchor tags
tags = soup.findAll('nav',{'class':'gc-toc'})

for tag in re.findall('a href="(.+?)" title="',str(tags)):
    urllist.append(tag)

print 'The number of links extracted is', len(urllist)
print '----------Printing Urls---------------'
for url in urllist:
    full_url = urllib.basejoin(base_url, url)

    if url.find('youtube') > 0: continue

    #Open the webpage and read html
    print 'Opening webpage file: ', full_url
    response = urllib.urlopen(full_url)
    response_html = response.read()

    #save the html file offlne
    print 'saving html file as ', url.split('/')[-1] +'.htm'

    output_file = open(os.path.join(save_path, url.split('/')[-1] +'.htm'),'w')
    output_file.write(response_html)
    output_file.close()

您可能实际上不想使用python。如果您只需要页面中的html，那么可以使用wget<代码>wgethttp://my.url将获取页面的html（如果您只需要这些）。或者，使用优秀的

请求

api，您可以做类似的事情

import requests
open('page', 'w').write(requests.get(url).text)

wget--no parent--mirror-p--html扩展--convert links-e robots=off-p。https://developers.google.com/edu/python/

我知道这对您的python代码没有帮助，但这是有效的。在python中完成这项工作其实不是一件小事，因为您必须获取所有URL（包括图像和javascript），并且必须重写html以指向下载的文件。谢谢。你是如何在windows上做到这一点的？@jgritty似乎是你的人；）祝你好运