在使用Python抓取web数据时,如何绕过页面视图限制?
我正在使用Python从http:/www.city-data.com通过以下目录获取美国邮政编码人口数据:。我试图抓取的特定页面是带有如下URL的单个邮政编码页面:。我需要访问的所有单个邮政编码页都具有相同的URL格式,因此我的脚本仅对范围内的邮政编码执行以下操作:在使用Python抓取web数据时,如何绕过页面视图限制?,python,http,python-2.7,web-scraping,Python,Http,Python 2.7,Web Scraping,我正在使用Python从http:/www.city-data.com通过以下目录获取美国邮政编码人口数据:。我试图抓取的特定页面是带有如下URL的单个邮政编码页面:。我需要访问的所有单个邮政编码页都具有相同的URL格式,因此我的脚本仅对范围内的邮政编码执行以下操作: 创建给定邮政编码的URL 尝试从URL获取响应 如果是(2),请检查该URL的HTTP 如果HTTP为200,则检索HTML并将数据刮到列表中 如果HTTP不是200,则传递和计数错误(不是有效的邮政编码/URL) 如果由于错误导
##POSTAL CODE POPULATION SCRAPER##
import requests
import re
import datetime
def zip_population_scrape():
"""
This script will scrape population data for postal codes in range
from city-data.com.
"""
postal_code_data = [['zip','population']] #list for storing scraped data
#Counters for keeping track:
total_scraped = 0
total_invalid = 0
errors = 0
for postal_code in range(1001,5000):
#This if statement is necessary because the postal code can't start
#with 0 in order for the for statement to interate successfully
if postal_code <10000:
postal_code_string = str(0)+str(postal_code)
else:
postal_code_string = str(postal_code)
#all postal code URLs have the same format on this site
url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'
#try to get current URL
try:
response = requests.get(url, timeout = 5)
http = response.status_code
#print current for logging purposes
print url +" - HTTP: " + str(http)
#if valid webpage:
if http == 200:
#save html as text
html = response.text
#extra print statement for status updates
print "HTML ready"
#try to find two substrings in HTML text
#add the substring in between them to list w/ postal code
try:
found = re.search('population in 2011:</b> (.*)<br>', html).group(1)
#add to # scraped counter
total_scraped +=1
postal_code_data.append([postal_code_string,found])
#print statement for logging
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if substrings not found, try searching for others
#and doing the same as above
except AttributeError:
found = re.search('population in 2010:</b> (.*)<br>', html).group(1)
total_scraped +=1
postal_code_data.append([postal_code_string,found])
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if http =404, zip is not valid. Add to counter and print log
elif http == 404:
total_invalid +=1
print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."
#other http codes: add to error counter and print log
else:
errors +=1
print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."
#if get url fails by connnection error, add to error count & pass
except requests.exceptions.ConnectionError:
errors +=1
print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
pass
#if get url fails by timeout error, add to error count & pass
except requests.exceptions.Timeout:
errors +=1
print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
pass
#print final log/counter data, along with timestamp finished
now= datetime.datetime.now()
print now.strftime("%Y-%m-%d %H:%M")
print str(total_scraped) + " total zips scraped."
print str(total_invalid) + " total unavailable zips."
print str(errors) + " total errors."
邮政编码##
导入请求
进口稀土
导入日期时间
def zip_population_scrape():
"""
此脚本将在范围内刮取邮政编码的人口数据
来自city-data.com。
"""
邮政编码(数据=[['zip','population']]#用于存储临时数据的列表
#用于跟踪的计数器:
总刮削量=0
总计\u无效=0
错误=0
对于范围(10015000)内的邮政编码:
#此if语句是必需的,因为邮政编码无法启动
#使用0以使for语句成功交互
如果邮政编码不确定你为什么需要刮这些。您看过美国人口普查局的吗?您可以从几个代理运行脚本,并在请求之间休眠一段时间。@vch美国人口普查局的数据不容易获得,也不太准确(这是我的第一个想法)。city data.com的数据更容易访问,也更准确。@solarc我以前从未使用过代理的脚本,也从未编写过在请求之间休眠的脚本。你能提供一个代码答案吗?@TylerPalmer
import time;时间。睡眠(5)
将睡眠5秒钟。您可以将它添加到您的循环中(可能每100个循环调用一次),并利用时间尝试让他们的服务器重置页面浏览限制。