Python 使用回调和以下链接刮取的项目数量不一致_Python_Scrapy

Python 使用回调和以下链接刮取的项目数量不一致

python scrapy

Python 使用回调和以下链接刮取的项目数量不一致,python,scrapy,Python,Scrapy,我正在抓取黄页结果，并且在尝试跟踪黄页条目和分页链接时，会得到数量不一致的刮取项目。我相信我有两个问题，但我似乎能够解决第一个问题。希望这个解决方法不会引起我的第二个问题我没有问题得到121个搜索结果，我期望从。我是根据官方教程来做这件事的： class LinksSpider(scrapy.Spider): name = "links" start_urls = [ r"https://www.paginasamarillas.es/search/admini

我正在抓取黄页结果，并且在尝试跟踪黄页条目和分页链接时，会得到数量不一致的刮取项目。我相信我有两个问题，但我似乎能够解决第一个问题。希望这个解决方法不会引起我的第二个问题

我没有问题得到121个搜索结果，我期望从。我是根据官方教程来做这件事的：

class LinksSpider(scrapy.Spider):
    name = "links"
    start_urls = [
        r"https://www.paginasamarillas.es/search/administrador-de-fincas/all-ma/zaragoza/all-is/zaragoza/all-ba/all-pu/all-nc/1?what=Administrador%20de%20fincas&where=Zaragoza",
    ]

    def parse(self, response):
        for comercial in response.css('div.col-xs-11.comercial-nombre .row a'):
            href = comercial.attrib["href"]
            if '#' not in href:
                print('href = ', href)
                yield {
                    'name': comercial.css('h2 span::text').get(),
                    'link': href,
                }

        list_items = response.css('ul.pagination li')
        for li in list_items:
            anchor = li.css('a')
            i = anchor.css('i')
            if len(i) != 0:
                next_page = anchor.attrib['href']
                print('next_page = ', next_page)

        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

class FullSpider(scrapy.Spider):
    name = "full"
    start_urls = [
        r"https://www.paginasamarillas.es/search/administrador-de-fincas/all-ma/zaragoza/all-is/zaragoza/all-ba/all-pu/all-nc/1?what=Administrador%20de%20fincas&where=Zaragoza",
    ]

    def parse(self, response):
        for comercial in response.css('div.col-xs-11.comercial-nombre .row a'):
            href = comercial.attrib["href"]
            # sleep(1)
            if '#' not in href:
                print('href = ', href)
                yield response.follow(href, self.parse_comercial, meta={'link': href})

        list_items = response.css('ul.pagination li')
        for li in list_items:
            anchor = li.css('a')
            i = anchor.css('i')
            if len(i) != 0:
                next_page = anchor.attrib['href']
                print('next_page = ', next_page)

        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    def parse_comercial(self, response):

        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('div.titular > h1::text'),
            'link': response.meta.get('link'),
            'sitioWeb': extract_with_css('div.botonesCta > a:not([id^="cfContacta"])::attr(href)'),
        }

因此，我的第一个问题是必须创建涉及列表项的for循环，作为获得下一个页面的解决方法。这是因为将您带到下一页的按钮与第1、2、…、5页的快捷方式不同。“下一页”按钮是本教程中唯一一个包含以下内容的按钮：

class LinksSpider(scrapy.Spider):
    name = "links"
    start_urls = [
        r"https://www.paginasamarillas.es/search/administrador-de-fincas/all-ma/zaragoza/all-is/zaragoza/all-ba/all-pu/all-nc/1?what=Administrador%20de%20fincas&where=Zaragoza",
    ]

    def parse(self, response):
        for comercial in response.css('div.col-xs-11.comercial-nombre .row a'):
            href = comercial.attrib["href"]
            if '#' not in href:
                print('href = ', href)
                yield {
                    'name': comercial.css('h2 span::text').get(),
                    'link': href,
                }

        list_items = response.css('ul.pagination li')
        for li in list_items:
            anchor = li.css('a')
            i = anchor.css('i')
            if len(i) != 0:
                next_page = anchor.attrib['href']
                print('next_page = ', next_page)

        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

class FullSpider(scrapy.Spider):
    name = "full"
    start_urls = [
        r"https://www.paginasamarillas.es/search/administrador-de-fincas/all-ma/zaragoza/all-is/zaragoza/all-ba/all-pu/all-nc/1?what=Administrador%20de%20fincas&where=Zaragoza",
    ]

    def parse(self, response):
        for comercial in response.css('div.col-xs-11.comercial-nombre .row a'):
            href = comercial.attrib["href"]
            # sleep(1)
            if '#' not in href:
                print('href = ', href)
                yield response.follow(href, self.parse_comercial, meta={'link': href})

        list_items = response.css('ul.pagination li')
        for li in list_items:
            anchor = li.css('a')
            i = anchor.css('i')
            if len(i) != 0:
                next_page = anchor.attrib['href']
                print('next_page = ', next_page)

        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    def parse_comercial(self, response):

        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('div.titular > h1::text'),
            'link': response.meta.get('link'),
            'sitioWeb': extract_with_css('div.botonesCta > a:not([id^="cfContacta"])::attr(href)'),
        }

这几乎产生了我想要的准确结果，除了返回的项目数有时是91，有时是94，而不是121，以此类推：

2020-02-17 18:01:50 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-17 18:01:50 [scrapy.extensions.feedexport] INFO: Stored json feed (94 items) in: full.json

我的研究暗示了这些请求的异步性质和/或产生了两种不同的结果，我知道，但可能其中一种是罪魁祸首，但减慢ie速度、增加睡眠或简化代码、要求它产生更少的键似乎并不能改善情况

当我输入这个时，我注意到一个引用重复过滤的粗糙统计数据，但我真的无法理解这些统计数据。我会把它们扔在这里，以防有人帮我

在第一个十字轴和正确的结果之后：

2020-02-17 17:37:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3167,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 5,
 'downloader/response_bytes': 134605,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 5,
 'dupefilter/filtered': 1,
 'elapsed_time_seconds': 4.895525,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 2, 17, 16, 37, 7, 343129),
 'item_scraped_count': 121,
 'log_count/DEBUG': 127,
 'log_count/INFO': 11,
 'request_depth_max': 5,
 'response_received_count': 5,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2020, 2, 17, 16, 37, 2, 447604)}

第二个蜘蛛，90-94结果不一致：

2020-02-17 18:01:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 62095,
 'downloader/request_count': 99,
 'downloader/request_method_count/GET': 99,
 'downloader/response_bytes': 2154817,
 'downloader/response_count': 99,
 'downloader/response_status_count/200': 99,
 'dupefilter/filtered': 28,
 'elapsed_time_seconds': 9.433248,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 2, 17, 17, 1, 50, 120636),
 'item_scraped_count': 94,
 'log_count/DEBUG': 194,
 'log_count/INFO': 11,
 'request_depth_max': 5,
 'response_received_count': 99,
 'scheduler/dequeued': 99,
 'scheduler/dequeued/memory': 99,
 'scheduler/enqueued': 99,
 'scheduler/enqueued/memory': 99,
 'start_time': datetime.datetime(2020, 2, 17, 17, 1, 40, 687388)}

很抱歉这么长，但我很感激你给我的提示。谢谢

编辑：多亏了@furas，第一个问题似乎解决了。但仍在努力解决第二个问题，结果不一致。

若要获得下一页的链接，您可以搜索具有类fa图标的链接，其中显示图标>。可以使用其他图标，您只能检查分页中的最后一个元素-列表项目[-1]-以检查是否有指向下一页的链接。您应该在第页上询问一个问题。另一个问题应该在新的页面上。关于列表项[-1]的好提示，最终我将使用您基于xpath的其他建议，所以不需要，但谢谢。关于一篇文章中的两个问题，我认为第一个问题可能会影响第二个问题，但我明白你的意思。感谢xpath片段，它确实清理了代码，我还学习了一些关于嵌套元素的知识。链接爬行器正在生成121个预期链接。然而，完整的spider仍然是不一致的。例如，为什么它会在一次运行中产生89个结果，在下次运行时产生86个结果，然后产生84个结果，然后产生93个结果？2020-02-18 15:04:08[scrapy.core.engine]信息：Closing spider finished 2020-02-18 15:04:08[scrapy.extensions.feedexport]信息：将json提要89项存储在：full.json 2020-02-18 15:04:32[scrapy.core.engine]信息：Closing spider finished 2020-02-18 15:04:32[scrapy.extensions.feedexport]信息：在full.json 2020-02-18 15:04:54[scrapy.core.engine]中存储json提要86项信息：Closing spider finished 2020-02-18 15:04:54[scrapy.extensions.feedexport]信息：在full.json2020-02-18 15:05:14[scrapy.core.engine]中存储json提要84项信息：关闭spider finished 2020-02-18 15:05:14[scrapy.extensions.feedexport]信息：在full.json 2020-02-18 15:06:33[scrapy.core.engine]信息：关闭spider finished 2020-02-18 15:06:33[scrapy.extensions.feedexport]中存储json提要93项信息：在：full.jsonI中存储的json提要89个条目可以在统计信息中看到它每次重复过滤不同的数量，重复的数量不应该总是相同的吗？我只运行了几次，总是得到相同的数量。但可能服务器在发送所有数据时出现问题，或者出于某种原因发送不同的数据，例如，它可能使用两台或更多计算机发送响应，其中一些计算机具有旧数据。您必须将所有页面保存几次，并对它们进行比较，以查看它们是否有不同的内容。

response.xpath('//ul[@class="pagination"]//li/a[i[@class="fa icon-flecha-derecha"]]/@href').get() #.getall()