在使用Python抓取web数据时，如何绕过页面视图限制？_Python_Http_Python 2.7_Web Scraping

在使用Python抓取web数据时，如何绕过页面视图限制？

python http python-2.7 web-scraping

在使用Python抓取web数据时，如何绕过页面视图限制？,python,http,python-2.7,web-scraping,Python,Http,Python 2.7,Web Scraping,我正在使用Python从http:/www.city-data.com通过以下目录获取美国邮政编码人口数据：。我试图抓取的特定页面是带有如下URL的单个邮政编码页面：。我需要访问的所有单个邮政编码页都具有相同的URL格式，因此我的脚本仅对范围内的邮政编码执行以下操作：创建给定邮政编码的URL 尝试从URL获取响应如果是（2），请检查该URL的HTTP 如果HTTP为200，则检索HTML并将数据刮到列表中如果HTTP不是200，则传递和计数错误（不是有效的邮政编码/URL）如果由于错误导

我正在使用Python从http:/www.city-data.com通过以下目录获取美国邮政编码人口数据：。我试图抓取的特定页面是带有如下URL的单个邮政编码页面：。我需要访问的所有单个邮政编码页都具有相同的URL格式，因此我的脚本仅对范围内的邮政编码执行以下操作：

创建给定邮政编码的URL

尝试从URL获取响应

如果是（2），请检查该URL的HTTP

如果HTTP为200，则检索HTML并将数据刮到列表中

如果HTTP不是200，则传递和计数错误（不是有效的邮政编码/URL）

如果由于错误导致URL没有响应，请传递该邮政编码并计数错误

在脚本末尾，打印计数器变量和时间戳

问题是，我运行了这个脚本，它对大约500个邮政编码都能正常工作，然后突然停止工作并返回重复的超时错误。我的怀疑是，该网站的服务器限制了来自我的IP地址的页面浏览量，阻止我完成我需要完成的刮取量（所有100000个潜在邮政编码）

我的问题如下：有没有办法混淆网站的服务器，例如使用某种代理，这样它就不会限制我的页面浏览量，我就可以刮取我需要的所有数据

谢谢你的帮助！代码如下：

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range 
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0 
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start 
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code) 
        else:
            postal_code_string = str(postal_code) 

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL 
        try: 
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:            

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above    
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log         
            elif http == 404: 
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now() 
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped." 
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."

邮政编码## 导入请求进口稀土导入日期时间 def zip_population_scrape（）： """ 此脚本将在范围内刮取邮政编码的人口数据来自city-data.com。 """ 邮政编码(数据=[['zip'，'population']]#用于存储临时数据的列表 #用于跟踪的计数器：总刮削量=0 总计\u无效=0 错误=0 对于范围（10015000）内的邮政编码： #此if语句是必需的，因为邮政编码无法启动 #使用0以使for语句成功交互

如果邮政编码不确定你为什么需要刮这些。您看过美国人口普查局的吗？您可以从几个代理运行脚本，并在请求之间休眠一段时间。@vch美国人口普查局的数据不容易获得，也不太准确（这是我的第一个想法）。city data.com的数据更容易访问，也更准确。@solarc我以前从未使用过代理的脚本，也从未编写过在请求之间休眠的脚本。你能提供一个代码答案吗？@TylerPalmer

import time；时间。睡眠（5）

将睡眠5秒钟。您可以将它添加到您的循环中（可能每100个循环调用一次），并利用时间尝试让他们的服务器重置页面浏览限制。