Pagination 刮痧';s分页错误
嗨,伙计们,我在抓取一个网站时遇到了以下分页错误Pagination 刮痧';s分页错误,pagination,scrapy,scrapy-spider,Pagination,Scrapy,Scrapy Spider,嗨,伙计们,我在抓取一个网站时遇到了以下分页错误 2017-07-27 18:30:21 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20B
2017-07-27 18:30:21 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20Bibi&lat=-23.585202&lng=-46.671715199999994> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Documents/Spiders/pedidosYa/pedidosYa/spiders/pedidosya.py", line 35, in parse
next_page_url = response.urljoin(next_page_url)
File "/usr/local/lib/python3.5/dist-packages/scrapy/http/response/text.py", line 82, in urljoin
return urljoin(get_base_url(self), url)
File "/usr/lib/python3.5/urllib/parse.py", line 416, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "/usr/lib/python3.5/urllib/parse.py", line 112, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-27 18:30:21 [scrapy.extensions.feedexport] INFO: Stored csv feed (13 items) in: test3.csv
2017-07-27 18:30:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 653,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 62571,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 27, 23, 30, 21, 221038),
'item_scraped_count': 13,
'log_count/DEBUG': 16,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'memusage/max': 49278976,
'memusage/startup': 49278976,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2017, 7, 27, 23, 30, 17, 538310)}
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Spider closed (finished)
提前谢谢你,祝你度过美好的一天
next_page_url = response.css('li.arrow.next > a ::attr(href)').extract()
^^^^^^^^^^
if next_page_url:
next_page_url = response.urljoin(next_page_url)
^^^^^^^^^^^^^
在这里,您正在列表上调用urljoin
,因为创建next\u page\u url
时extract()
方法返回所有值的列表,即使它只是一个成员。要解决此问题,请改用
extract\u first()
:
next_page_url = response.css('li.arrow.next > a ::attr(href)').extract_first()
^^^^^^^^^^^^^^^
问题出在这一行:
next_page_url = response.css('li.arrow.next > a::attr(href)').extract()
因为extract()。或者使用extract\u first()
方法,该方法将仅给出第一个结果:
next_page_url = response.css('li.arrow.next > a::attr(href)').extract_first()
或者自己获取结果列表的第一个元素:
next_page_url = response.css('li.arrow.next > a::attr(href)').extract()[0]
next_page_url = response.css('li.arrow.next > a::attr(href)').extract()[0]