Redirect 刮痧重定向302
我只是爬到一个网页,但重定向一个网页。我补充道 handle_httpstatus_list=[302301] 并覆盖Redirect 刮痧重定向302,redirect,scrapy,Redirect,Scrapy,我只是爬到一个网页,但重定向一个网页。我补充道 handle_httpstatus_list=[302301] 并覆盖start\u请求方法。但问题是 AttributeError: 'Response' object has no attribute 'xpath' 蜘蛛代码: # -*- coding=utf-8 -*- from __future__ import absolute_import from scrapy.linkextractors import LinkExtracto
start\u请求
方法。但问题是
AttributeError: 'Response' object has no attribute 'xpath'
蜘蛛代码:
# -*- coding=utf-8 -*-
from __future__ import absolute_import
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule,Spider
from car.items import Car58Item
import scrapy
import time
class Car51Spider (CrawlSpider):
name = 'car51'
allowed_domains = ['51auto.com']
start_urls = ['http://www.51auto.com/quanguo/pabmdcigf?searchtype=searcarlist&curentPage=1&isNewsCar=0&isSaleCar=0&isQa=0&orderValue=record_time']
rules = [Rule(LinkExtractor(allow=('/pabmdcigf?searchtype=searcarlist&curentPage=\d+\&isNewsCar\=0\&isSaleCar\=0\&isQa\=0\&orderValue\=record_time')),callback='parse_item',follow=True)] #//页面读取策略
handle_httpstatus_list = [302,301]
items = {}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, callback=self.parse_item)
def parse_item(self,response):
trs = response.xpath("//div[@class='view-grid-overflow']/a").extract()
for tr in trs:
sales_1 = u''
item = Car58Item()
urls = tr.xpath("a/@href").extract_first()
item['url'] = tr.xpath("a/@href").extract_first()
item['tip'] = tr.xpath("a/ul/li[@class='title']/text()").extract_first()
item['name'] = tr.xpath("a/ul/li[@class='title']/text()").extract_first()
sales_times = tr.xpath("a/ul/li[@class='info']/span/text()").extract()
for x in sales_times:
sales_1 = sales_1 + x
item['sales_time'] = sales_1
item['region'] = tr.xpath("a/ul/li[@class='info']/span[@class='font-color-red']/text()").extract_first()
item['amt'] = tr.xpath("a/ul/li[@class='price']/div[1]/text()").extract_first()
yield scrapy.Request(url=urls,callback=self.parse_netsted_item,meta={'item':item})
def parse_netsted_item(self,response):
dh = u''
dha = u''
mode = response.xpath("//body")
item = Car58Item(response.meta['item'])
dhs = mode.xpath("//div[@id='contact-tel1']/p/text()").extract()
for x in dhs:
dh = dh + x
item['lianxiren_dh'] = dh
lianxiren = mode.xpath("//div[@class='section-contact']/text()").extract()
item['lianxiren'] = lianxiren[1]
item['lianxiren_dz'] = lianxiren[2]
item['details'] = mode.xpath("//div[@id='car-dangan']").extract()
desc = mode.xpath("//div[@class='car-detail-container']/p/text()").extract()
for d in desc:
dha = dha + d
item['description'] = dha
item['image_urls'] = mode.xpath("//div[@class='car-pic']/img/@src").extract()
item['collection_dt'] = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(time.time()))
return item
setting.py
# -*- coding: utf-8 -*-
# Scrapy settings for car project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'car'
SPIDER_MODULES = ['car.spiders.car51']
#NEWSPIDER_MODULE = 'car.spiders.zhaoming'
DEFAULT_ITEM_CLASS = 'car.items.Car58Item'
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1,
'car.pipelines.MongoDBPipeline': 300,
'car.pipelines.Car58ImagesPipeline': 301
}
MONGODB_SERVER ="localhost"
MONGODB_PORT=27017
MONGODB_DB="car"
MONGODB_COLLECTION_CAR="car"
MONGODB_COLLECTION_ZHAOMING="zhaoming"
IMAGES_STORE = "img/"
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
IMAGES_EXPIRES = 90
DOWNLOAD_TIMEOUT=10
LOG_ENABLED=True
LOG_ENCODING='utf-8'
LOG_LEVEL="DEBUG"
LOGSTATS_INTERVAL=5
# LOG_FILE='/tmp/scrapy.log'
CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
刮削原木:
$scrapy crawl car51
2016-06-14 14:18:38 [scrapy] INFO: Scrapy 1.1.0 started (bot: car)
2016-06-14 14:18:38 [scrapy] INFO: Overridden settings: {'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'SPIDER_MODULES': ['car.spiders.car51'], 'BOT_NAME': 'car', 'DOWNLOAD_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 5, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:35.0) Gecko/20100101 Firefox/35.0', 'DEFAULT_ITEM_CLASS': 'car.items.Car58Item', 'DOWNLOAD_DELAY': 0.25}
2016-06-14 14:18:38 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-06-14 14:18:38 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-14 14:18:38 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-14 14:18:38 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/deprecate.py:156: ScrapyDeprecationWarning: `scrapy.contrib.pipeline.images.ImagesPipeline` class is deprecated, use `scrapy.pipelines.images.ImagesPipeline` instead
ScrapyDeprecationWarning)
2016-06-14 14:18:38 [py.warnings] WARNING: /Users/mayuping/PycharmProjects/car/car/pipelines.py:13: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
2016-06-14 14:18:38 [scrapy] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline',
'car.pipelines.MongoDBPipeline',
'car.pipelines.Car58ImagesPipeline']
2016-06-14 14:18:38 [scrapy] INFO: Spider opened
2016-06-14 14:18:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-14 14:18:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-14 14:18:38 [scrapy] DEBUG: Crawled (302) <GET http://www.51auto.com/quanguo/pabmdcigf?searchtype=searcarlist&curentPage=1&isNewsCar=0&isSaleCar=0&isQa=0&orderValue=record_time> (referer: None)
**2016-06-14 14:18:39 [scrapy] ERROR: Spider error processing <GET http://www.51auto.com/quanguo/pabmdcigf?searchtype=searcarlist&curentPage=1&isNewsCar=0&isSaleCar=0&isQa=0&orderValue=record_time> (referer: None)**
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/mayuping/PycharmProjects/car/car/spiders/car51.py", line 22, in parse_item
trs = response.xpath("//div[@class='view-grid-overflow']/a").extract()
AttributeError: 'Response' object has no attribute 'xpath'
2016-06-14 14:18:39 [scrapy] INFO: Closing spider (finished)
2016-06-14 14:18:39 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 351,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 420,
'downloader/response_count': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 14, 6, 18, 39, 56461),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'log_count/WARNING': 2,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2016, 6, 14, 6, 18, 38, 437336)}
2016-06-14 14:18:39 [scrapy] INFO: Spider closed (finished)
$scrapy crawl car51
2016-06-14 14:18:38[scrapy]信息:scrapy 1.1.0启动(机器人:汽车)
2016-06-14 14:18:38[scrapy]信息:覆盖的设置:{'CONCURRENT_REQUESTS_PER_DOMAIN':16,'SPIDER_MODULES':['car.spiders.car51'],'BOT_NAME':'car','DOWNLOAD_TIMEOUT':10,'LOGSTATS_INTERVAL':5,'USER_AGENT':'Mozilla/5.0(Windows NT 6.1;rv:35.0)Gecko/20100101 Firefox/35.0,“默认项目类”:“car.items.Car58Item”,“下载延迟”:0.25}
2016-06-14 14:18:38[scrapy]信息:启用的扩展:
['scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats']
2016-06-14 14:18:38[scrapy]信息:已启用的下载程序中间件:
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.DownloaderMiddleware.chunked.ChunkedTransfererMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2016-06-14 14:18:38[scrapy]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2016-06-14 14:18:38[py.warnings]警告:/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/scrapy/utils/deprecate.py:156:scrapydepreaction警告:`scrapy.contrib.pipeline.images.images.spipeline`类已被否决,请改用`scrapy.pipeline.images.images.images.images.pipeline.ima
(警告)
2016-06-14 14:18:38[py.warnings]警告:/Users/mayuping/PycharmProjects/car/car/pipelines.py:13:scrapydepreaction警告:模块'scrapy.log'已被弃用,scrapy现在依赖内置Python库进行日志记录。阅读文档中更新的日志记录条目以了解更多信息。
从scrapy导入日志
2016-06-14 14:18:38[scrapy]信息:启用的项目管道:
['scrapy.pipeline.images.ImagesPipeline',
“汽车、管道、MongoDBPipeline”,
“car.pipelines.Car58ImagesPipeline”]
2016-06-14 14:18:38[刮擦]信息:蜘蛛打开
2016-06-14 14:18:38[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2016-06-14 14:18:38[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2016-06-14 14:18:38[scrapy]调试:爬网(302)(参考:无)
**2016-06-14 14:18:39[刮擦]错误:蜘蛛错误处理(参考:无)**
回溯(最近一次呼叫最后一次):
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/scrapy/utils/defer.py”,第102行,在iter\u errback中
下一个(it)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/scrapy/spidermiddleware/offsite.py”,第29行,进程中输出
对于结果中的x:
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/scrapy/spidermiddleware/referer.py”,第22行,在
返回(_set_referer(r)表示结果中的r或())
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/scrapy/spidermiddleware/urlength.py”,第37行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/scrapy/spidermiddleware/depth.py”,第58行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“/Users/mayuping/PycharmProjects/car/car/spider/car51.py”,第22行,在parse_项中
trs=response.xpath(“//div[@class='view-grid-overflow']/a”).extract()
AttributeError:“Response”对象没有属性“xpath”
2016-06-14 14:18:39[刮擦]信息:关闭卡盘(已完成)
2016-06-14 14:18:39[scrapy]信息:倾销scrapy统计数据:
{'downloader/request_bytes':351,
“下载程序/请求计数”:1,
“downloader/request\u method\u count/GET”:1,
“downloader/response_字节”:420,
“下载程序/响应计数”:1,
“下载程序/响应状态\计数/302”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2016,6,14,6,18,3956461),
“日志计数/调试”:2,
“日志计数/错误”:1,
“日志计数/信息”:7,
“日志计数/警告”:2,
“响应\u已接收\u计数”:1,
“调度程序/出列”:1,
“调度程序/出列/内存”:1,
“调度程序/排队”:1,
“调度程序/排队/内存”:1,
“spider_异常/属性错误”:1,
“开始时间”:datetime.datetime(2016,6,14,6,18,38437336)}
2016-06-14 14:18:39[刮擦]信息:蜘蛛网关闭(完成)
当您添加handle\u httpstatus\u list=[302301]
时,您告诉Scrapy即使在进行HTTP重定向时也要调用回调,而不是让框架为您透明地处理重定向(这是默认设置)
一些重定向的HTTP响应没有正文或内容头,因此在这些情况下,在回调中,Scrapy按原样将响应传递给您,即一个普通的response
对象,而不是一个HtmlResponse
,您有.xpath()
和.css()
快捷方式
要么您确实需要处理HTTP 301和302响应,要么您需要编写回调,以便它测试状态代码(response.status
),仅在非3xx情况下提取数据