Python Scrapy没有为每个链接执行Scrapy.Request回调函数

Python Scrapy没有为每个链接执行Scrapy.Request回调函数,python,html,web-scraping,scrapy,scrapy-spider,Python,Html,Web Scraping,Scrapy,Scrapy Spider,我正在尝试制作一个易趣蜘蛛,它会遍历页面上的每个产品链接,并针对每个链接访问每个链接,并在解析链接功能中对新页面进行处理 我正在刮这个 在parse函数中,它迭代每个链接精细打印出每个链接精细,但只为页面上的一个链接调用parse函数 我的意思是,每个页面都有50个左右的产品,我将获得每个产品链接,对于每个链接,请访问每个链接,并在pase_link功能中执行一些操作 但是对于每个页面,parse_link函数只为一个链接调用(大约50个链接中) 这是密码 class EbayspiderSpi

我正在尝试制作一个易趣蜘蛛,它会遍历页面上的每个产品链接,并针对每个链接访问每个链接,并在解析链接功能中对新页面进行处理

我正在刮这个

parse函数中,它迭代每个链接精细打印出每个链接精细,但只为页面上的一个链接调用parse函数

我的意思是,每个页面都有50个左右的产品,我将获得每个产品链接,对于每个链接,请访问每个链接,并在pase_link功能中执行一些操作

但是对于每个页面,parse_link函数只为一个链接调用(大约50个链接中)

这是密码

class EbayspiderSpider(scrapy.Spider):
    name = "ebayspider"
    #allowed_domains = ["ebay.com"]
    start_urls = ['http://www.ebay.com/sch/hfinney/m.html?item=132127244893&rt=nc&_trksid=p2047675.l2562']

    def parse(self, response):
        global c

        for attr in response.xpath('//*[@id="ListViewInner"]/li'):
            item = EbayItem()
            linkse = '.vip ::attr(href)'
            link = attr.css('a.vip ::attr(href)').extract_first()
            c+=1
            print '', 'I AM HERE', link, '\t', c
            yield scrapy.Request(link, callback=self.parse_link, meta={'item': item})
        next_page = '.gspr.next ::attr(href)'
        next_page = response.css(next_page).extract_first()
        print '\nI AM NEXT PAGE\n'
        if next_page:
            yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)

    def parse_link(self, response):
        global c2
        c2+=1
        print '\n\n\tIam in parselink\t', c2
查看每50个左右的链接,scrapy仅在我打印时执行解析链接1次,计数使用全局变量提取的链接数和执行解析链接的次数

items.py

# -*- coding: utf-8 -*-

# Scrapy settings for ebay project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ebay'

SPIDER_MODULES = ['ebay.spiders']
NEWSPIDER_MODULE = 'ebay.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ebay (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ebay.middlewares.EbaySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ebay.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'ebay.pipelines.EbayPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
import scrapy
from scrapy.item import Item, Field



class EbayItem(scrapy.Item):
    NAME = scrapy.Field()
    MPN = scrapy.Field()
    ITEMID = scrapy.Field()
    PRICE = scrapy.Field()
    FREIGHT_1_for_quan_1 = scrapy.Field()
    FREIGHT_2_for_quan_2 = scrapy.Field()
    DATE = scrapy.Field()
    QUANTITY = scrapy.Field()
    CATAGORY = scrapy.Field()
    SUBCATAGORY = scrapy.Field()
    SUBCHILDCATAGORY = scrapy.Field()
pipelines.py尽管我没有接触过此文件

Middleware.py也未触及此文件

from scrapy import signals


class EbaySpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

解决方案:无需修复,似乎工作正常

我很快运行了你的代码(只做了一些轻微的修改,比如删除全局变量和替换EbayItem),它运行良好,可以访问你正在创建的所有URL

解释/这里发生了什么:

我怀疑你的刮板是安排在一种方式,使它看起来好像不是访问所有链接的URL但它可以,只能在以后使用。

我怀疑您已将并发请求设置为2。这就是为什么scrapy正在安排51个URL中的2个用于下一个处理。在这两个URL中,有一个下一页URL,它创建了另外51个请求。这些新请求将旧的49个请求推到队列的后面。。。等等,等等,直到没有下一个链接为止

如果你运行刮板足够长的时间,你会看到所有链接迟早会被访问。最有可能的是,最先创建的49个“缺失”请求将在最后访问


您还可以删除下一页请求的创建,以查看是否访问了所有50个链接。

您可以打印变量
next\u page
的内容吗?但它只是删除了每个页面的第一个元素。谢谢您的反馈,让我在禁用下一页的情况下再次检查,我会让您知道,再次感谢你,parse_link函数对于每个页面(50个链接)只触发1到2次,原因是什么?这是一个非常不寻常的问题,的确非常奇怪。您确定它总是跟随页面上的第一个链接吗?你的日志显示这是最后一个链接。由于它一直在这里工作,我们需要检查差异:1。
的缩进是否产生了scrapy.Request(link,
与此处发布的完全相同?2.请链接到一个成功完成的scraping运行的完整日志。3.请发布您的设置。pyWow,因此技术上我的答案是正确的。您发布的代码工作:-)请,请,请不要再这样做了:不要在这里发布代码,并期待其他未显示代码的帮助。完整、详细地发布您的代码,测试答案时,请测试您在此处发布的相同代码,而不是其他代码。这将节省我们很多时间。谢谢
class EbayPipeline(object):
    def process_item(self, item, spider):
        return item
from scrapy import signals


class EbaySpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)