Python Scrapy没有为每个链接执行Scrapy.Request回调函数_Python_Html_Web Scraping_Scrapy_Scrapy Spider

Python Scrapy没有为每个链接执行Scrapy.Request回调函数

python html web-scraping scrapy

Python Scrapy没有为每个链接执行Scrapy.Request回调函数,python,html,web-scraping,scrapy,scrapy-spider,Python,Html,Web Scraping,Scrapy,Scrapy Spider,我正在尝试制作一个易趣蜘蛛，它会遍历页面上的每个产品链接，并针对每个链接访问每个链接，并在解析链接功能中对新页面进行处理我正在刮这个在parse函数中，它迭代每个链接精细打印出每个链接精细，但只为页面上的一个链接调用parse函数我的意思是，每个页面都有50个左右的产品，我将获得每个产品链接，对于每个链接，请访问每个链接，并在pase_link功能中执行一些操作但是对于每个页面，parse_link函数只为一个链接调用（大约50个链接中）这是密码 class EbayspiderSpi

我正在尝试制作一个易趣蜘蛛，它会遍历页面上的每个产品链接，并针对每个链接访问每个链接，并在解析链接功能中对新页面进行处理
我正在刮这个
在parse函数中，它迭代每个链接精细打印出每个链接精细，但只为页面上的一个链接调用parse函数
我的意思是，每个页面都有50个左右的产品，我将获得每个产品链接，对于每个链接，请访问每个链接，并在pase_link功能中执行一些操作
但是对于每个页面，parse_link函数只为一个链接调用（大约50个链接中）
这是密码

class EbayspiderSpider(scrapy.Spider): name = "ebayspider" #allowed_domains = ["ebay.com"] start_urls = ['http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562'] def parse(self, response): global c for attr in response.xpath('//*[@id="ListViewInner"]/li'): item = EbayItem() linkse = '.vip ::attr(href)' link = attr.css('a.vip ::attr(href)').extract_first() c+=1 print '', 'I AM HERE', link, '\t', c yield scrapy.Request(link, callback=self.parse_link, meta={'item': item}) next_page = '.gspr.next ::attr(href)' next_page = response.css(next_page).extract_first() print '\nI AM NEXT PAGE\n' if next_page: yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse) def parse_link(self, response): global c2 c2+=1 print '\n\n\tIam in parselink\t', c2
查看每50个左右的链接，scrapy仅在我打印时执行解析链接1次，计数使用全局变量提取的链接数和执行解析链接的次数
items.py

# -*- coding: utf-8 -*- # Scrapy settings for ebay project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'ebay' SPIDER_MODULES = ['ebay.spiders'] NEWSPIDER_MODULE = 'ebay.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'ebay (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'ebay.middlewares.EbaySpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'ebay.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'ebay.pipelines.EbayPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

import scrapy from scrapy.item import Item, Field class EbayItem(scrapy.Item): NAME = scrapy.Field() MPN = scrapy.Field() ITEMID = scrapy.Field() PRICE = scrapy.Field() FREIGHT_1_for_quan_1 = scrapy.Field() FREIGHT_2_for_quan_2 = scrapy.Field() DATE = scrapy.Field() QUANTITY = scrapy.Field() CATAGORY = scrapy.Field() SUBCATAGORY = scrapy.Field() SUBCHILDCATAGORY = scrapy.Field()
pipelines.py尽管我没有接触过此文件
Middleware.py也未触及此文件

from scrapy import signals class EbaySpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass def process_start_requests(start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)

解决方案：无需修复，似乎工作正常
我很快运行了你的代码（只做了一些轻微的修改，比如删除全局变量和替换EbayItem），它运行良好，可以访问你正在创建的所有URL
解释/这里发生了什么：
我怀疑你的刮板是安排在一种方式，使它看起来好像不是访问所有链接的URL但它可以，只能在以后使用。
我怀疑您已将并发请求设置为2。这就是为什么scrapy正在安排51个URL中的2个用于下一个处理。在这两个URL中，有一个下一页URL，它创建了另外51个请求。这些新请求将旧的49个请求推到队列的后面。。。等等，等等，直到没有下一个链接为止
如果你运行刮板足够长的时间，你会看到所有链接迟早会被访问。最有可能的是，最先创建的49个“缺失”请求将在最后访问

您还可以删除下一页请求的创建，以查看是否访问了所有50个链接。
您可以打印变量
next\u page
的内容吗？但它只是删除了每个页面的第一个元素。谢谢您的反馈，让我在禁用下一页的情况下再次检查，我会让您知道，再次感谢你，parse_link函数对于每个页面（50个链接）只触发1到2次，原因是什么？这是一个非常不寻常的问题，的确非常奇怪。您确定它总是跟随页面上的第一个链接吗？你的日志显示这是最后一个链接。由于它一直在这里工作，我们需要检查差异：1。
的缩进是否产生了scrapy.Request（link，
与此处发布的完全相同？2.请链接到一个成功完成的scraping运行的完整日志。3.请发布您的设置。pyWow，因此技术上我的答案是正确的。您发布的代码工作：-）请，请，请不要再这样做了：不要在这里发布代码，并期待其他未显示代码的帮助。完整、详细地发布您的代码，测试答案时，请测试您在此处发布的相同代码，而不是其他代码。这将节省我们很多时间。谢谢
class EbayPipeline(object): def process_item(self, item, spider): return item

from scrapy import signals class EbaySpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass def process_start_requests(start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)