For loop 多次返回相同结果的Scrapy回调函数

For loop 多次返回相同结果的Scrapy回调函数,for-loop,asynchronous,xpath,scrapy,scrapy-spider,For Loop,Asynchronous,Xpath,Scrapy,Scrapy Spider,我是Scrapy的新手,我无法让回调函数正常工作。我设法获得所有URL,并设法在回调函数中跟踪它们,但当我得到结果时,我多次收到一些结果,许多结果丢失。有什么问题吗 import scrapy from kexcrawler.items import KexcrawlerItem class KexSpider(scrapy.Spider): name = 'kex' allowed_domains = ["kth.diva-portal.org"] start_ur

我是Scrapy的新手,我无法让回调函数正常工作。我设法获得所有URL,并设法在回调函数中跟踪它们,但当我得到结果时,我多次收到一些结果,许多结果丢失。有什么问题吗

import scrapy

from kexcrawler.items import KexcrawlerItem

class KexSpider(scrapy.Spider):
    name = 'kex'
    allowed_domains = ["kth.diva-portal.org"]
    start_urls = ['http://kth.diva-portal.org/smash/resultList.jsf?dswid=-855&language=en&searchType=RESEARCH&query=&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%7B%22dateIssued%22%3A%7B%22from%22%3A%222015%22%2C%22to%22%3A%222015%22%7D%7D%2C%7B%22organisationId%22%3A%225956%22%2C%22organisationId-Xtra%22%3Atrue%7D%2C%7B%22publicationTypeCode%22%3A%5B%22article%22%5D%7D%2C%7B%22contentTypeCode%22%3A%5B%22refereed%22%5D%7D%5D%5D&aqe=%5B%5D&noOfRows=250&sortOrder=author_sort_asc&onlyFullText=false&sf=all']

def parse(self, response):
    for href in response.xpath('//li[@class="ui-datalist-item"]/div[@class="searchItem borderColor"]/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
    item = KexcrawlerItem()
    item['report'] = response.xpath('//div[@class="toSplitVertically"]/div[@id="innerEastCenter"]/span[@class="displayFields"]/span[@class="subTitle"]/text()').extract()
        yield item
以下是结果的第一行:

{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Comparing Vocal Fold Contact Criteria Derived From Audio and Electroglottographic Signals"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Dynamic message-passing approach for kinetic spin models with reversible dynamics"]},
{"report": ["RNA editing of non-coding RNA and its role in gene regulation"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},

我试图复制您的错误,但无法复制。所有的URL都是不同的。我在信息级别记录了每个项目,并抑制了下面的所有内容,发现每个报告也都是唯一的。我没有取消您的yield调用,因为它向我抛出了一个错误,并用一个字段定义了您的item类。如果您直接从终端复制并粘贴,那么我认为这是打印的结果,而不是日志,这让我认为您可能有多个打印调用,这些调用在不同的时间被调用。试着把文件写在某个地方,看看是否真的有重复的。为了测试URL是否唯一,我将xpath中的元素提取到一个名为elem的列表中,然后:
打印透镜(elem)
b=集合()
对于元素中的e:
b、 加(e)
打印透镜(b)

您可以尝试创建一个全局项目列表,然后添加一个函数spider_closed,该函数将在关闭时自动调用,然后在该列表上执行相同的操作。集合只包含唯一的元素,如果存在差异,那么您实际上是在创建重复项。

我确信您永远不应该覆盖
scrapy
中的
parse
方法,这是它的大部分实现所在is@gtlambert事实并非如此,您必须重写parse方法,因为它是scrapy的入口点。您的意思是当使用
LinkExtractor
s时:在这种情况下,您不能覆盖
parse
方法,因为它有一个需要的默认实现(或者您可以自己实现,但在这种情况下,您不需要内置的提取器引擎)@Agnes您是否查看了您的
parse
方法向新请求提供的URL?Scrapy不会过滤不同的结果,而是过滤加载的URL。如果在URL中有一些会话参数,则可以得到多个结果。如果要过滤结果,请创建一个自定义项导出器,该导出器标记已导出的图元并对其进行过滤。