Python Scrapy没有为每个链接执行Scrapy.Request回调函数
我正在尝试制作一个易趣蜘蛛,它会遍历页面上的每个产品链接,并针对每个链接访问每个链接,并在解析链接功能中对新页面进行处理 我正在刮这个 在parse函数中,它迭代每个链接精细打印出每个链接精细,但只为页面上的一个链接调用parse函数 我的意思是,每个页面都有50个左右的产品,我将获得每个产品链接,对于每个链接,请访问每个链接,并在pase_link功能中执行一些操作 但是对于每个页面,parse_link函数只为一个链接调用(大约50个链接中) 这是密码Python Scrapy没有为每个链接执行Scrapy.Request回调函数,python,html,web-scraping,scrapy,scrapy-spider,Python,Html,Web Scraping,Scrapy,Scrapy Spider,我正在尝试制作一个易趣蜘蛛,它会遍历页面上的每个产品链接,并针对每个链接访问每个链接,并在解析链接功能中对新页面进行处理 我正在刮这个 在parse函数中,它迭代每个链接精细打印出每个链接精细,但只为页面上的一个链接调用parse函数 我的意思是,每个页面都有50个左右的产品,我将获得每个产品链接,对于每个链接,请访问每个链接,并在pase_link功能中执行一些操作 但是对于每个页面,parse_link函数只为一个链接调用(大约50个链接中) 这是密码 class EbayspiderSpi
class EbayspiderSpider(scrapy.Spider):
name = "ebayspider"
#allowed_domains = ["ebay.com"]
start_urls = ['http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562']
def parse(self, response):
global c
for attr in response.xpath('//*[@id="ListViewInner"]/li'):
item = EbayItem()
linkse = '.vip ::attr(href)'
link = attr.css('a.vip ::attr(href)').extract_first()
c+=1
print '', 'I AM HERE', link, '\t', c
yield scrapy.Request(link, callback=self.parse_link, meta={'item': item})
next_page = '.gspr.next ::attr(href)'
next_page = response.css(next_page).extract_first()
print '\nI AM NEXT PAGE\n'
if next_page:
yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)
def parse_link(self, response):
global c2
c2+=1
print '\n\n\tIam in parselink\t', c2
查看每50个左右的链接,scrapy仅在我打印时执行解析链接1次,计数使用全局变量提取的链接数和执行解析链接的次数
items.py
# -*- coding: utf-8 -*-
# Scrapy settings for ebay project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'ebay'
SPIDER_MODULES = ['ebay.spiders']
NEWSPIDER_MODULE = 'ebay.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ebay (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ebay.middlewares.EbaySpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'ebay.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'ebay.pipelines.EbayPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
import scrapy
from scrapy.item import Item, Field
class EbayItem(scrapy.Item):
NAME = scrapy.Field()
MPN = scrapy.Field()
ITEMID = scrapy.Field()
PRICE = scrapy.Field()
FREIGHT_1_for_quan_1 = scrapy.Field()
FREIGHT_2_for_quan_2 = scrapy.Field()
DATE = scrapy.Field()
QUANTITY = scrapy.Field()
CATAGORY = scrapy.Field()
SUBCATAGORY = scrapy.Field()
SUBCHILDCATAGORY = scrapy.Field()
pipelines.py尽管我没有接触过此文件
Middleware.py也未触及此文件
from scrapy import signals
class EbaySpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
解决方案:无需修复,似乎工作正常 我很快运行了你的代码(只做了一些轻微的修改,比如删除全局变量和替换EbayItem),它运行良好,可以访问你正在创建的所有URL 解释/这里发生了什么: 我怀疑你的刮板是安排在一种方式,使它看起来好像不是访问所有链接的URL但它可以,只能在以后使用。 我怀疑您已将并发请求设置为2。这就是为什么scrapy正在安排51个URL中的2个用于下一个处理。在这两个URL中,有一个下一页URL,它创建了另外51个请求。这些新请求将旧的49个请求推到队列的后面。。。等等,等等,直到没有下一个链接为止 如果你运行刮板足够长的时间,你会看到所有链接迟早会被访问。最有可能的是,最先创建的49个“缺失”请求将在最后访问
您还可以删除下一页请求的创建,以查看是否访问了所有50个链接。您可以打印变量
next\u page
的内容吗?但它只是删除了每个页面的第一个元素。谢谢您的反馈,让我在禁用下一页的情况下再次检查,我会让您知道,再次感谢你,parse_link函数对于每个页面(50个链接)只触发1到2次,原因是什么?这是一个非常不寻常的问题,的确非常奇怪。您确定它总是跟随页面上的第一个链接吗?你的日志显示这是最后一个链接。由于它一直在这里工作,我们需要检查差异:1。的缩进是否产生了scrapy.Request(link,
与此处发布的完全相同?2.请链接到一个成功完成的scraping运行的完整日志。3.请发布您的设置。pyWow,因此技术上我的答案是正确的。您发布的代码工作:-)请,请,请不要再这样做了:不要在这里发布代码,并期待其他未显示代码的帮助。完整、详细地发布您的代码,测试答案时,请测试您在此处发布的相同代码,而不是其他代码。这将节省我们很多时间。谢谢
class EbayPipeline(object):
def process_item(self, item, spider):
return item
from scrapy import signals
class EbaySpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)