For loop 多次返回相同结果的Scrapy回调函数_For Loop_Asynchronous_Xpath_Scrapy_Scrapy Spider

For loop 多次返回相同结果的Scrapy回调函数

for-loop asynchronous xpath scrapy

For loop 多次返回相同结果的Scrapy回调函数,for-loop,asynchronous,xpath,scrapy,scrapy-spider,For Loop,Asynchronous,Xpath,Scrapy,Scrapy Spider,我是Scrapy的新手，我无法让回调函数正常工作。我设法获得所有URL，并设法在回调函数中跟踪它们，但当我得到结果时，我多次收到一些结果，许多结果丢失。有什么问题吗 import scrapy from kexcrawler.items import KexcrawlerItem class KexSpider(scrapy.Spider): name = 'kex' allowed_domains = ["kth.diva-portal.org"] start_ur

我是Scrapy的新手，我无法让回调函数正常工作。我设法获得所有URL，并设法在回调函数中跟踪它们，但当我得到结果时，我多次收到一些结果，许多结果丢失。有什么问题吗

import scrapy

from kexcrawler.items import KexcrawlerItem

class KexSpider(scrapy.Spider):
    name = 'kex'
    allowed_domains = ["kth.diva-portal.org"]
    start_urls = ['http://kth.diva-portal.org/smash/resultList.jsf?dswid=-855&language=en&searchType=RESEARCH&query=&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%7B%22dateIssued%22%3A%7B%22from%22%3A%222015%22%2C%22to%22%3A%222015%22%7D%7D%2C%7B%22organisationId%22%3A%225956%22%2C%22organisationId-Xtra%22%3Atrue%7D%2C%7B%22publicationTypeCode%22%3A%5B%22article%22%5D%7D%2C%7B%22contentTypeCode%22%3A%5B%22refereed%22%5D%7D%5D%5D&aqe=%5B%5D&noOfRows=250&sortOrder=author_sort_asc&onlyFullText=false&sf=all']

def parse(self, response):
    for href in response.xpath('//li[@class="ui-datalist-item"]/div[@class="searchItem borderColor"]/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
    item = KexcrawlerItem()
    item['report'] = response.xpath('//div[@class="toSplitVertically"]/div[@id="innerEastCenter"]/span[@class="displayFields"]/span[@class="subTitle"]/text()').extract()
        yield item

以下是结果的第一行：

{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Comparing Vocal Fold Contact Criteria Derived From Audio and Electroglottographic Signals"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Dynamic message-passing approach for kinetic spin models with reversible dynamics"]},
{"report": ["RNA editing of non-coding RNA and its role in gene regulation"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},

我试图复制您的错误，但无法复制。所有的URL都是不同的。我在信息级别记录了每个项目，并抑制了下面的所有内容，发现每个报告也都是唯一的。我没有取消您的yield调用，因为它向我抛出了一个错误，并用一个字段定义了您的item类。如果您直接从终端复制并粘贴，那么我认为这是打印的结果，而不是日志，这让我认为您可能有多个打印调用，这些调用在不同的时间被调用。试着把文件写在某个地方，看看是否真的有重复的。为了测试URL是否唯一，我将xpath中的元素提取到一个名为elem的列表中，然后：

打印透镜（elem）
b=集合（）
对于元素中的e：
b、 加（e）
打印透镜（b）

您可以尝试创建一个全局项目列表，然后添加一个函数spider_closed，该函数将在关闭时自动调用，然后在该列表上执行相同的操作。集合只包含唯一的元素，如果存在差异，那么您实际上是在创建重复项。

我确信您永远不应该覆盖

scrapy

中的

parse

方法，这是它的大部分实现所在is@gtlambert事实并非如此，您必须重写parse方法，因为它是scrapy的入口点。您的意思是当使用

LinkExtractor

s时：在这种情况下，您不能覆盖

parse

方法，因为它有一个需要的默认实现（或者您可以自己实现，但在这种情况下，您不需要内置的提取器引擎）@Agnes您是否查看了您的

parse

方法向新请求提供的URL？Scrapy不会过滤不同的结果，而是过滤加载的URL。如果在URL中有一些会话参数，则可以得到多个结果。如果要过滤结果，请创建一个自定义项导出器，该导出器标记已导出的图元并对其进行过滤。