Python 如何防止靓汤跑慢_Python_Web Scraping_Beautifulsoup_Proxy

Python 如何防止靓汤跑慢

python web-scraping proxy

Python 如何防止靓汤跑慢,python,web-scraping,beautifulsoup,proxy,Python,Web Scraping,Beautifulsoup,Proxy,我试图从一所大学图书馆的网站上抓取大约360个URL（它们没有开放的API）代码第一次运行良好，直到最后5个URL 然后它有一个“索引器” 我为此添加了一个异常并再次运行代码。现在它运行得非常慢。每次循环大约1分钟这个网站在限制我吗？有什么解决办法吗 def extract_page(df, page_list): # Create counter counter=0 # Looping through all existing URLs for url in

我试图从一所大学图书馆的网站上抓取大约360个URL（它们没有开放的API）

代码第一次运行良好，直到最后5个URL

然后它有一个“索引器”

我为此添加了一个异常并再次运行代码。现在它运行得非常慢。每次循环大约1分钟

这个网站在限制我吗？有什么解决办法吗

def extract_page(df, page_list):
    # Create counter
    counter=0
    # Looping through all existing URLs
    for url in df["dc.identifier.uri"]:
        counter+=1

        try:
            # headers
            headers = requests.utils.default_headers()
            headers.update({
                'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            })
            # Using requests and BS to extract content
            r = requests.get(url, headers=headers)
            soup = BeautifulSoup(r.content)
            # Creating empty list to nest all URLs on the page
            url_list = []
            # Drilling into the extracted text to find all instances of "-PHD"
            for a in soup.find_all('a', href=True):
                if "-PHD" in a['href']:
                    abstract_url = "https://repository.nie.edu.sg" + a['href']
                    url_list.append(abstract_url)
            # Append first instance of this
            page_list.append(url_list[0])
            # Creating a time delay to be polite to the server
            time.sleep(0.3)
            # Print progress report and flush
            sys.stdout.write('\r'+ "PROCESSING: "+ str(counter) + "/" + str(df.shape[0]) +  " >>> " + url + " >>> " + url_list[0])
            time.sleep(randint(1,3))

        except (OSError,MaxRetryError, ConnectionError, IndexError) as e:
            counter+=1
            page_list.append("Error Encountered.")
    # Creating a new column in the dataset
    df["abstract_page_url"] = page_list
    print("\r>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>COMPLETED ", counter,"/", nie.shape[0])
    return df

原因可能是一些URL需要很长时间才能响应。。您可以为每个请求设置最大超时时间。。

r=requests.get（url，headers=headers，timeout=30）

他们不太可能用这么少的请求来限制你

通常，当我遇到缓慢的请求时，我喜欢使用多线程来加速进程。这样，较慢的请求不会停止其他请求。（只有刮削360可能会造成过度杀伤力，但如果它们真的需要一分钟，这将节省大量时间。）

这需要创建两个函数——一个用于提交请求，另一个用于使用漂亮的汤解析html。我相信还有其他方法可以做到这一点，但这对我来说效果很好。祝你好运

from concurrent.futures import ThreadPoolExecutor 
with ThreadPoolExecutor(max_workers = 25) as ex:
   responses = ex.map(page_to_scrape_request, list_of_urls)

   for response in responses:
      parse_response(response)