Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/346.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python-本地保存请求或美化组对象_Python_File_Beautifulsoup_Scrape - Fatal编程技术网

Python-本地保存请求或美化组对象

Python-本地保存请求或美化组对象,python,file,beautifulsoup,scrape,Python,File,Beautifulsoup,Scrape,我有一些相当长的代码,所以需要很长时间才能运行。我只想在本地保存requests对象(在本例中为“name”)或BeautifulSoup对象(在本例中为“soup”),以便下次可以节省时间。代码如下: from bs4 import BeautifulSoup import requests url = 'SOMEURL' name = requests.get(url) soup = BeautifulSoup(name.content) 由于name.content只是HTML,因此您

我有一些相当长的代码,所以需要很长时间才能运行。我只想在本地保存requests对象(在本例中为“name”)或BeautifulSoup对象(在本例中为“soup”),以便下次可以节省时间。代码如下:

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

由于
name.content
只是
HTML
,因此您可以将其转储到文件中,稍后再将其读回

通常,瓶颈不是解析,而是发出请求的网络延迟

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup
以下是一些轶事证据,证明瓶颈在网络中

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2
输出,在Thinkpad X1 Carbon上运行,具有快速校园网络

0.11 0.02
在本地存储请求并将其还原为Beautifull Soup对象 如果您在网站的各个页面中进行迭代,则可以使用此处介绍的
请求存储每个页面。
在脚本所在的同一文件夹中创建文件夹
soupCategory

标题使用任意

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return 
在上,您可以创建Beautifull Soup对象,如@merlin2011所述:

with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

您可能会发现该模块很有用……将
html
源代码保存到
html
文件中怎么样?仅供参考,您可以将
BeautifulSoup(f.read())
替换为just
BeautifulSoup(f)
@alecxe,已更新。谢谢