Python 分页未应用Scrapy爬行器和LinkExtractor规则
无法理解为什么scrapy中的爬行蜘蛛在设置规则的情况下不进行分页 但是,如果将start_url更改为并注释掉parse_start_url,我会在上面的页面中获得更多的条目 我的目标是刮所有类别。你知道我做错了什么吗Python 分页未应用Scrapy爬行器和LinkExtractor规则,python,scrapy,web-crawler,scrapy-spider,web-scripting,Python,Scrapy,Web Crawler,Scrapy Spider,Web Scripting,无法理解为什么scrapy中的爬行蜘蛛在设置规则的情况下不进行分页 但是,如果将start_url更改为并注释掉parse_start_url,我会在上面的页面中获得更多的条目 我的目标是刮所有类别。你知道我做错了什么吗 import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from bitcointravel.items import
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bitcointravel.items import BitcointravelItem
class BitcoinSpider(CrawlSpider):
name = "bitcoin"
allowed_domains = ["bitcoin.travel"]
start_urls = [
"http://bitcoin.travel/categories/"
]
rules = (
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('.+/page/\d+/$'), restrict_xpaths=('//a[@class="next page-numbers"]'),),
callback='parse_items', follow=True),
)
def parse_start_url(self, response):
for sel in response.xpath("//ul[@class='maincat-list']/li"):
url = sel.xpath('a/@href').extract()[0]
if url == 'http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/':
# url = 'http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/'
yield scrapy.Request(url, callback=self.parse_items)
def parse_items(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
for sel in response.xpath("//div[@class='grido']"):
item = BitcointravelItem()
item['name'] = sel.xpath('a/@title').extract()
item['website'] = sel.xpath('a/@href').extract()
yield item
这就是结果
{'downloader/request_bytes': 574,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 98877,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'dupefilter/filtered': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 15, 13, 44, 17, 37859),
'item_scraped_count': 24,
'log_count/DEBUG': 28,
'log_count/INFO': 8,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 2, 15, 13, 44, 11, 250892)}
2016-02-15 14:44:17 [scrapy] INFO: Spider closed (finished)
假定项目计数为55而不是24,因为HTML源包含与规则中的模式匹配的链接。+/page/\d+/$”
<a class='page-numbers' href='http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/page/2/'>2</a>
<a class='page-numbers' href='http://bitcoin.travel/listing-category/bitcoin-hotels-and-travel/page/3/'>3</a>
Wherese不包含类似的链接,主要包含指向其他类别页面的链接:
...
<li class="cat-item cat-item-227"><a href="http://bitcoin.travel/listing-category/bitcoin-food/bitcoin-coffee-tea-supplies/" title="The best Coffee & Tea Supplies businesses where you can spend your bitcoins!">Coffee & Tea Supplies</a> </li>
<li class="cat-item cat-item-50"><a href="http://bitcoin.travel/listing-category/bitcoin-food/bitcoin-cupcakes/" title="The best Cupcakes businesses where you can spend your bitcoins!">Cupcakes</a> </li>
<li class="cat-item cat-item-229"><a href="http://bitcoin.travel/listing-category/bitcoin-food/bitcoin-distilleries/" title="The best Distilleries businesses where you can spend your bitcoins!">Distilleries</a> </li>
...
。。。
...
如果要爬网更多,则需要添加规则来爬网这些类别页面