Python-本地保存请求或美化组对象_Python_File_Beautifulsoup_Scrape

Python-本地保存请求或美化组对象

python file

Python-本地保存请求或美化组对象,python,file,beautifulsoup,scrape,Python,File,Beautifulsoup,Scrape,我有一些相当长的代码，所以需要很长时间才能运行。我只想在本地保存requests对象（在本例中为“name”）或BeautifulSoup对象（在本例中为“soup”），以便下次可以节省时间。代码如下： from bs4 import BeautifulSoup import requests url = 'SOMEURL' name = requests.get(url) soup = BeautifulSoup(name.content) 由于name.content只是HTML，因此您

我有一些相当长的代码，所以需要很长时间才能运行。我只想在本地保存requests对象（在本例中为“name”）或BeautifulSoup对象（在本例中为“soup”），以便下次可以节省时间。代码如下：

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

由于

name.content

只是

HTML

，因此您可以将其转储到文件中，稍后再将其读回

通常，瓶颈不是解析，而是发出请求的网络延迟

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup

以下是一些轶事证据，证明瓶颈在网络中

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2

输出，在Thinkpad X1 Carbon上运行，具有快速校园网络

0.11 0.02

在本地存储请求并将其还原为Beautifull Soup对象如果您在网站的各个页面中进行迭代，则可以使用此处介绍的

请求存储每个页面。
在脚本所在的同一文件夹中创建文件夹soupCategory

对标题使用任意
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return 

在上，您可以创建Beautifull Soup对象，如@merlin2011所述：
with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

您可能会发现该模块很有用……将html
源代码保存到html
文件中怎么样？仅供参考，您可以将BeautifulSoup（f.read（））
替换为justBeautifulSoup（f）
@alecxe，已更新。谢谢