Pagination 最后一页未显示在scrapy中_Pagination_Scrapy

Pagination 最后一页未显示在scrapy中

pagination scrapy

Pagination 最后一页未显示在scrapy中,pagination,scrapy,Pagination,Scrapy,因此，我下面的代码（粘贴）几乎满足了我的要求。相反，它覆盖了29/30页，然后省略了最后一页。此外，我更愿意让它超越，但该网站没有它的按钮（网页实际上做工作时，你手动填写页=31的链接）。当Depth_Limit为29时，一切正常，但在30时，我在命令提示符中得到以下错误： File "C:\Users\Ewald\Scrapy\OB\OB\spiders\spider_OB.py", line 23, in parse next_link = 'https://zoek.officielebe

因此，我下面的代码（粘贴）几乎满足了我的要求。相反，它覆盖了29/30页，然后省略了最后一页。此外，我更愿意让它超越，但该网站没有它的按钮（网页实际上做工作时，你手动填写页=31的链接）。当Depth_Limit为29时，一切正常，但在30时，我在命令提示符中得到以下错误：

File "C:\Users\Ewald\Scrapy\OB\OB\spiders\spider_OB.py", line 23, in parse
next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[@class="volgende"]/@href').extract()[0]
IndexError: list index out of range

我尝试过各种方法，但它们似乎都让我失望

class OB_Crawler(CrawlSpider):
name = 'OB5'
allowed_domains = ["https://www.officielebekendmakingen.nl/"]
start_urls = ["https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=DatumPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"]
custom_settings = {
'BOT_NAME': 'OB-crawler',
'DEPTH_LIMIT': 30,
'DOWNLOAD_DELAY': 0.1
}

def parse(self, response):
    s = Selector(response)
    next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[@class="volgende"]/@href').extract()[0]
    if len(next_link):
        yield self.make_requests_from_url(next_link)
    posts = response.selector.xpath('//div[@class = "lijst"]/ul/li')
    for post in posts:
        i = TextPostItem()
        i['title'] = ' '.join(post.xpath('a/@href').extract()).replace(';', '').replace('  ', '').replace('\r\n', '')
        i['link'] = ' '.join(post.xpath('a/text()').extract()).replace(';', '').replace('  ', '').replace('\r\n', '')
        i['info'] = ' '.join(post.xpath('a/em/text()').extract()).replace(';', '').replace('  ', '').replace('\r\n', '').replace(',', '-')
        yield i

索引超出范围错误是由于xpath不正确造成的（最终调用的是空列表的第一项）

将“下一个链接=…”更改为

您需要使用contains，它运行谓词搜索。。过滤你想要的东西

next_link = 'https://zoek.officielebekendmakingen.nl/' + s.xpath('//a[contains(@class, "volgende")]/@href').extract()[0]