Python 以不同的方式使用scrapy处理分页站点_Python_Python 3.x_Web Scraping_Pagination_Scrapy

Python 以不同的方式使用scrapy处理分页站点

python python-3.x web-scraping pagination scrapy

Python 以不同的方式使用scrapy处理分页站点,python,python-3.x,web-scraping,pagination,scrapy,Python,Python 3.x,Web Scraping,Pagination,Scrapy,我用python编写了一个脚本，使用scrapy解析网页中的一些信息。该网页中的可用数据通过分页进行遍历。如果我使用response.follow（），那么我就可以完成它。但是，我想遵循我在请求中实现的逻辑，使用BeautifulSoup在scrapy中实现，但找不到任何想法使用requests和beautifulsou我可以想出一个很好的方法： import requests from bs4 import BeautifulSoup page = 0 URL = 'http://ese

我用python编写了一个脚本，使用

scrapy

解析网页中的一些信息。该网页中的可用数据通过分页进行遍历。如果我使用

response.follow（）

，那么我就可以完成它。但是，我想遵循我在

请求中实现的逻辑，使用BeautifulSoup
在scrapy
中实现，但找不到任何想法
使用requests
和beautifulsou
我可以想出一个很好的方法：
import requests
from bs4 import BeautifulSoup

page = 0 
URL = 'http://esencjablog.pl/page/{}/'

while True:
    page+=1
    res = requests.get(URL.format(page))
    soup = BeautifulSoup(res.text,'lxml')
    items = soup.select('.post_more a.qbutton')
    if len(items)<=1:break

    for a in items:
        print(a.get("href"))

再一次：我的问题是，如果我想在scrapy
中使用我已经尝试过的requests
和BeautifulSoup
方法，当最后一个页码未知时，结构会是怎样的？
您必须使用scrapy。请求：
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'
    start_urls = ['http://esencjablog.pl/page/58']

    def parse(self, response):
        # Find href from next page link
        link = response.css('.post_more a.qbutton::attr(href)') 
        if link:
            # Extract href, in this case we can use first because you only need 1
            href = link.extract_first()
            # just in case the website use relative hrefs
            url = response.urljoin(href)
            # You may change the callback if you want to use a different method 
            yield scrapy.Request(url, callback=self.parse) 

您可以在
中找到更多详细信息。您必须使用scrapy。请求：
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'
    start_urls = ['http://esencjablog.pl/page/58']

    def parse(self, response):
        # Find href from next page link
        link = response.css('.post_more a.qbutton::attr(href)') 
        if link:
            # Extract href, in this case we can use first because you only need 1
            href = link.extract_first()
            # just in case the website use relative hrefs
            url = response.urljoin(href)
            # You may change the callback if you want to use a different method 
            yield scrapy.Request(url, callback=self.parse) 

您可以在
中找到更多详细信息。在这种情况下，您无法利用并行下载，但由于您希望在Scrapy中模拟相同的内容，因此可以通过不同的方式实现
方法1-使用页码生成页面
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']


    def parse(self, response):
        # we commnicate the page numbers using request meta
        # this is not mandatory as we can extract the same data from 
        # the response.url also. But I prefer using meta here

        page_no = response.meta.get('page', 1) + 1

        items = response.css('.post_more a.qbutton')
        for link in items:
            yield{"link":link.css('::attr(href)').extract_first()}

        if items:
            # if items were found we move to the next page
            yield Request("http://esencjablog.pl/page/{}".format(page_no), meta={"page": page_no}, callback=self.parse)

理想的方法通常是，如果您可以从第一个请求中找到最后一页计数，那么您将提取该数字并在firstparse
调用中一次触发所有请求。但这只有在知道最后一页的页码时才有效
方法2-使用对象生成下一页
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']


    def parse(self, response):
        items = response.css('.post_more a.qbutton')
        for link in items:
            yield{"link":link.css('::attr(href)').extract_first()}

        next_page = response.xpath('//li[contains(@class, "next_last")]/a/@href')
        if next_page:
            yield response.follow(next_page) # follow to next page, and parse again

这只不过是@Konstantin所提到的一个直截了当的复制品。很抱歉，但我想让这是一个更完整的答案
方法3-在第一次响应时生成所有页面
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']
    first_request =  True

    def parse(self, response):
        if self.first_request:
            self.first_request = False
            last_page_num = response.css("fa-angle-double-right::href").re_first("(\d+)/?$")

            # yield all the pages on first request so we take advantage to parallel downloads
            for page_no in range(2, last_page_num + 1):
                yield Request("http://esencjablog.pl/page/{}".format(page_no), callback=self.parse)

        items = response.css('.post_more a.qbutton')
        for link in items:
            yield {"link":link.css('::attr(href)').extract_first()}

这种方法的最佳点是浏览第一个页面，然后检查最后一个页面的数量，并生成所有页面，以便同时进行下载。前两种方法在本质上更具顺序性，只有在您根本不想加载太多站点的情况下，才会遵循它们。刮刀的理想方法是方法3

现在关于meta
对象的使用，下面的链接对此进行了详细的解释

在此添加相同内容以供参考
向回调函数传递附加数据
请求的回调是在下载该请求的响应时调用的函数。将使用下载的响应对象作为其第一个参数来调用回调函数
例如：
def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

在某些情况下，您可能有兴趣将参数传递给这些回调函数，以便稍后在第二个回调中接收参数。您可以为此使用Request.meta属性
下面是一个如何使用此机制传递项目的示例，以填充不同页面中的不同字段：
def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

在这种情况下，您无法利用并行下载，但由于您希望在Scrapy中模拟相同的内容，因此可以通过不同的方式实现
方法1-使用页码生成页面
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']


    def parse(self, response):
        # we commnicate the page numbers using request meta
        # this is not mandatory as we can extract the same data from 
        # the response.url also. But I prefer using meta here

        page_no = response.meta.get('page', 1) + 1

        items = response.css('.post_more a.qbutton')
        for link in items:
            yield{"link":link.css('::attr(href)').extract_first()}

        if items:
            # if items were found we move to the next page
            yield Request("http://esencjablog.pl/page/{}".format(page_no), meta={"page": page_no}, callback=self.parse)

理想的方法通常是，如果您可以从第一个请求中找到最后一页计数，那么您将提取该数字并在firstparse
调用中一次触发所有请求。但这只有在知道最后一页的页码时才有效
方法2-使用对象生成下一页
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']


    def parse(self, response):
        items = response.css('.post_more a.qbutton')
        for link in items:
            yield{"link":link.css('::attr(href)').extract_first()}

        next_page = response.xpath('//li[contains(@class, "next_last")]/a/@href')
        if next_page:
            yield response.follow(next_page) # follow to next page, and parse again

这只不过是@Konstantin所提到的一个直截了当的复制品。很抱歉，但我想让这是一个更完整的答案
方法3-在第一次响应时生成所有页面
class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']
    first_request =  True

    def parse(self, response):
        if self.first_request:
            self.first_request = False
            last_page_num = response.css("fa-angle-double-right::href").re_first("(\d+)/?$")

            # yield all the pages on first request so we take advantage to parallel downloads
            for page_no in range(2, last_page_num + 1):
                yield Request("http://esencjablog.pl/page/{}".format(page_no), callback=self.parse)

        items = response.css('.post_more a.qbutton')
        for link in items:
            yield {"link":link.css('::attr(href)').extract_first()}

这种方法的最佳点是浏览第一个页面，然后检查最后一个页面的数量，并生成所有页面，以便同时进行下载。前两种方法在本质上更具顺序性，只有在您根本不想加载太多站点的情况下，才会遵循它们。刮刀的理想方法是方法3

现在关于meta
对象的使用，下面的链接对此进行了详细的解释

在此添加相同内容以供参考
向回调函数传递附加数据
请求的回调是在下载该请求的响应时调用的函数。将使用下载的响应对象作为其第一个参数来调用回调函数
例如：
def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

在某些情况下，您可能有兴趣将参数传递给这些回调函数，以便稍后在第二个回调中接收参数。您可以为此使用Request.meta属性
下面是一个如何使用此机制传递项目的示例，以填充不同页面中的不同字段：
def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

您可以像这样迭代页面：
您可以像这样迭代页面：
感谢@Tarun Lalwani为您提供的解决方案。这似乎是我一直在寻找的解决方案。到时候我会接受的。顺便说一句，你能提供这一行的更多细节吗？page_no=response.meta.get（'page'，1）+1
，因为我不熟悉meta
，但我愿意。谢谢。@Topto，你有机会检查更新的答案吗？是的，我已经注意到了。你的答案解决了一切。然而，我只是在等待悬赏，这个答案对所有人都是可见的，这样你的解决方案就可以得到更多的关注。谢谢。谢谢@Tarun Lalwani为您提供的解决方案。这似乎是我一直在寻找的解决方案。到时候我会接受的。顺便说一句，你能提供这一行的更多细节吗？page_no=response.meta.get（'page'，1）+1
，因为我不熟悉meta
，但我愿意。谢谢。@Topto，你有机会检查更新的答案吗？是的，我已经注意到了。你的答案解决了一切。然而，我只是在等待悬赏，这个答案对所有人都是可见的，这样你的解决方案就可以得到更多的关注。谢谢