使用scrapy翻页并获取每个页面的图像的url,但是回调方法在我看来不起作用
>输出: 收益项目将被称为“非常次”,且时间等于最大页 收益项目148762 收益项目148762 收益项目148762 收益项目148762 …非常非常,计数等于max_page >我的心: 收益项目只需调用一次 但实际上,收益项目被多次调用 >问题: 我不知道为什么代码是这样工作的使用scrapy翻页并获取每个页面的图像的url,但是回调方法在我看来不起作用,scrapy,Scrapy,>输出: 收益项目将被称为“非常次”,且时间等于最大页 收益项目148762 收益项目148762 收益项目148762 收益项目148762 …非常非常,计数等于max_page >我的心: 收益项目只需调用一次 但实际上,收益项目被多次调用 >问题: 我不知道为什么代码是这样工作的 我也很难理解你的爬虫 您当前的循环如下所示: # -*- coding: utf-8 -*- from scrapy_redis.spiders import RedisSpider from scrapy.sp
我也很难理解你的爬虫 您当前的循环如下所示:
# -*- coding: utf-8 -*- from scrapy_redis.spiders import RedisSpider from scrapy.spider import Request
from scrapy_redis_slaver.items import MzituSlaverItem
class MzituSpider(RedisSpider):
name = 'mzitu'
redis_key = 'mzitu:start_urls' # get start url from redis
def __init__(self, *args, **kwargs):
self.item = MzituSlaverItem()
def parse(self, response):
max_page = response.xpath(
"descendant::div[@class='main']/div[@class='content']/div[@class='pagenavi']/a[last()-1]/span/text()").extract_first(default="N/A")
max_page = int(max_page)
name = response.xpath("./*//div[@class='main']/div[1]/h2/text()").extract_first(default="N/A")
self.item['name'] = name
self.item['url'] = response.url
item_id = response.url.split('/')[-1]
self.item['item_id'] = item_id
# name: the pictures' title
# url: the pictures' the first url
# item_id: the pictures' id
# max_page: the pictures' max page
for num in range(1, max_page+1): # The cycle is turning pages.
# page_url is page address for each picture.
page_url = response.url + '/' + str(num)
yield Request(page_url, callback=self.img_url,meta={"name":name,
"item_id":item_id,
"max_page":max_page
})
def img_url(self, response):
# this function: get a picture's url from response
img_urls = response.xpath("descendant::div[@class='main-image']/descendant::img/@src").extract_first()
# a img_url
self.server.sadd('{}:{}:images'.format(response.meta['name'], response.meta['item_id']), img_urls)
# add a img_url to list of redis
len_redis_img_list = self.server.scard('{}:{}:images'.format(response.meta['name'], response.meta['item_id']))
# get the length of the img_url_list from redis
if len_redis_img_list == response.meta['max_page']:
self.item['img_urls'] = self.server.smembers('{}:{}:images'.format(response.meta['name'], response.meta['item_id']))
print("yield item",response.meta['item_id'])
yield self.item
# in my mind,when the len_redis_img_list is equal the max_page,the item will be yield one
# but actually,the item was yield the max_page times(very very more)
1. Go to product page
2. Find some item data
3. Find image urls
4. Chain loop through image urls to complete single item
我想你想要的是:
1. Go to product page
2. Find some item data
3. Split the item into `max_page` forks
3.1. Carry over data from #2 to every fork
4. Yield item in from every fork
您的爬虫应该如下所示:
# -*- coding: utf-8 -*- from scrapy_redis.spiders import RedisSpider from scrapy.spider import Request
from scrapy_redis_slaver.items import MzituSlaverItem
class MzituSpider(RedisSpider):
name = 'mzitu'
redis_key = 'mzitu:start_urls' # get start url from redis
def __init__(self, *args, **kwargs):
self.item = MzituSlaverItem()
def parse(self, response):
max_page = response.xpath(
"descendant::div[@class='main']/div[@class='content']/div[@class='pagenavi']/a[last()-1]/span/text()").extract_first(default="N/A")
max_page = int(max_page)
name = response.xpath("./*//div[@class='main']/div[1]/h2/text()").extract_first(default="N/A")
self.item['name'] = name
self.item['url'] = response.url
item_id = response.url.split('/')[-1]
self.item['item_id'] = item_id
# name: the pictures' title
# url: the pictures' the first url
# item_id: the pictures' id
# max_page: the pictures' max page
for num in range(1, max_page+1): # The cycle is turning pages.
# page_url is page address for each picture.
page_url = response.url + '/' + str(num)
yield Request(page_url, callback=self.img_url,meta={"name":name,
"item_id":item_id,
"max_page":max_page
})
def img_url(self, response):
# this function: get a picture's url from response
img_urls = response.xpath("descendant::div[@class='main-image']/descendant::img/@src").extract_first()
# a img_url
self.server.sadd('{}:{}:images'.format(response.meta['name'], response.meta['item_id']), img_urls)
# add a img_url to list of redis
len_redis_img_list = self.server.scard('{}:{}:images'.format(response.meta['name'], response.meta['item_id']))
# get the length of the img_url_list from redis
if len_redis_img_list == response.meta['max_page']:
self.item['img_urls'] = self.server.smembers('{}:{}:images'.format(response.meta['name'], response.meta['item_id']))
print("yield item",response.meta['item_id'])
yield self.item
# in my mind,when the len_redis_img_list is equal the max_page,the item will be yield one
# but actually,the item was yield the max_page times(very very more)
1. Go to product page
2. Find some item data
3. Find image urls
4. Chain loop through image urls to complete single item
谢谢,这个解决方案太完美了。这就是我想要的。调用回调方法时,该方法具有不同的fork,并且它们同时运行?是的,scrapy是异步的,同时管理多个进程。我建议看一下架构页面,因为它有一些很酷的插图和解释,说明了如何使用scrapy功能-肯定有一些学习曲线可以让你了解这一点: