Python Scrapy忽略第二页的内容_Python_Python 3.x_Web Scraping_Scrapy_Scrapy Spider

Python Scrapy忽略第二页的内容

python python-3.x web-scraping scrapy

Python Scrapy忽略第二页的内容,python,python-3.x,web-scraping,scrapy,scrapy-spider,Python,Python 3.x,Web Scraping,Scrapy,Scrapy Spider,我用python scrapy编写了一个小的scraper来解析网页中的不同名称。该页通过分页又遍历了4页。整页的名字总数是46个，但它只剩下36个 scraper应该跳过第一个登录页的内容，但是在我的scraper中使用parse\u start\u url参数，我已经处理了它然而，这个刮板目前面临的问题是，它意外地跳过了第二页的内容，并解析了所有其余的内容，我指的是第一页、第三页、第四页等等。为什么会发生这种情况以及如何应对？提前谢谢以下是我正在尝试的脚本： import scrapy

我用python scrapy编写了一个小的scraper来解析网页中的不同名称。该页通过分页又遍历了4页。整页的名字总数是46个，但它只剩下36个

scraper应该跳过第一个登录页的内容，但是在我的scraper中使用

parse\u start\u url

参数，我已经处理了它

然而，这个刮板目前面临的问题是，它意外地跳过了第二页的内容，并解析了所有其余的内容，我指的是第一页、第三页、第四页等等。为什么会发生这种情况以及如何应对？提前谢谢

以下是我正在尝试的脚本：

import scrapy

class DataokSpider(scrapy.Spider):

    name = "dataoksp"
    start_urls = ["https://data.ok.gov/browse?page=1&f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]

    def parse(self, response):
        for link in response.css('.pagination .pager-item a'):
            new_link = link.css("::attr(href)").extract_first()
            yield scrapy.Request(url=response.urljoin(new_link), callback=self.target_page)

    def target_page(self, response):
        parse_start_url = self.target_page  # I used this argument to capture the content of first page
        for titles in response.css('.title a'):
            name = titles.css("::text").extract_first()
            yield {'Name':name}

因为您在开始URL中指定的链接实际上是第二页的链接。如果你打开它，你会看到没有
此代码应该可以帮助您： import scrapy from scrapy.http import Request class DataokspiderSpider(scrapy.Spider): name = 'dataoksp' allowed_domains = ['data.ok.gov'] start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191",] def parse(self, response): for titles in response.css('.title a'): name = titles.css("::text").extract_first() yield {'Name':name} next_page = response.xpath('//li[@class="pager-next"]/a/@href').extract_first() if next_page: yield Request("https://data.ok.gov{}".format(next_page), callback=self.parse) 统计数据（请参见项目_scraped_count ）：结果证明，解决方案非常简单。我已经修好了 import scrapy class DataokSpider(scrapy.Spider): name = "dataoksp" start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"] def parse(self, response): for f_link in self.start_urls: yield response.follow(url=f_link, callback=self.target_page) #this is line which fixes the issue for link in response.css('.pagination .pager-item a'): new_link = link.css("::attr(href)").extract_first() yield response.follow(url=new_link, callback=self.target_page) def target_page(self, response): for titles in response.css('.title a'): name = titles.css("::text").extract_first() yield {'Name':name} 现在它给了我所有的结果。谢谢安德烈斯·佩雷斯·阿尔贝拉H.的回答。这个解决方案绝对有效，在这里发布之前，我也尝试过。然而，在scrapy中有一个内置的样式来解析来自第一页的数据，即parse_start_url ，我实际上希望我的脚本是建立在这个指导原则之上的。再次感谢。@Topto我很高兴我的回答有帮助。如果它回答了您的问题，请将其设置为所选答案，并向上投票，以防您重视我的支持。@Toptoparse\u start\u url用于爬行蜘蛛，而不是爬行蜘蛛。也就是说，如果您需要覆盖它，您需要首先从爬行蜘蛛继承。尽管如此，您的用例不需要爬行蜘蛛，因为您不需要规则，同样的行为也可以通过我添加到答案中的代码重现（仅使用爬行蜘蛛）。 import scrapy class DataokSpider(scrapy.Spider): name = "dataoksp" start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"] def parse(self, response): for f_link in self.start_urls: yield response.follow(url=f_link, callback=self.target_page) #this is line which fixes the issue for link in response.css('.pagination .pager-item a'): new_link = link.css("::attr(href)").extract_first() yield response.follow(url=new_link, callback=self.target_page) def target_page(self, response): for titles in response.css('.title a'): name = titles.css("::text").extract_first() yield {'Name':name}