Python 如何在xhr响应不为'时刮取无限滚动页面；不可读？_Python_Scrapy

Python 如何在xhr响应不为'时刮取无限滚动页面；不可读？

python scrapy

Python 如何在xhr响应不为'时刮取无限滚动页面；不可读？,python,scrapy,Python,Scrapy,我当时正试图抓取一个名为的网站。这是起始url-。它在页面上实现无限滚动，执行简单的爬网只返回50个结果。我想得到所有的职位在推荐部分这是我的密码- class TeamBlindReferrals(scrapy.Spider): name = 'blindreferrals' #define the start_requests methods to iterate and generate request objects from the start urls

我当时正试图抓取一个名为的网站。这是起始url-。它在页面上实现无限滚动，执行简单的爬网只返回50个结果。我想得到所有的职位在推荐部分

这是我的密码-

class TeamBlindReferrals(scrapy.Spider):

    name = 'blindreferrals'

    
    #define the start_requests methods to iterate and generate request objects from the start urls
    def start_requests(self):

        urls = ['https://www.teamblind.com/topics/Referrals']

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    #define the parse function to extract data from the website 
    def parse(self, response):

        posts = response.css('li.word-break')

        for item in posts:

            yield{

                'title': item.xpath('//a/@title').get(),
                'views': item.xpath("//a[contains(@class, 'view')]/text()").get(),
                'comments': item.xpath("//a[contains(@class, 'comment')]/text()").get(),
                'likes': item.xpath("//a[contains(@class, 'like')]/text()").get(),
                'link': item.xpath("//li/a/@href").get()
            }

我在一个答案中发现，通过检查开发人员工具的“网络”选项卡中的响应URL，可以刮取具有无限滚动的网站。我试着检查那边的回复，发现了两件事-

这个URL看起来很普通，我不认为它会让我得到推荐部分，而是整个文章列表

响应由一种称为有效负载的东西组成，它有很多字符（假设它是某种“加密”）-

在这种情况下，我该如何处理呢？

从技术上讲，对站点进行反向工程应该是可行的，但在这里，这远不是一件小事，需要弄清楚JavaScript是如何对有效负载进行解码的。提示：从查看JavaScript文件中的

payload

引用开始

否则，Splash可能不适合，因为每个请求都必须向下滚动，直到到达所需页面。当您到达以后的页面时，获取目标页面所需的请求数和时间将显著增加

因此，如果逆向工程不在考虑之列，Selenium或类似的替代品将是唯一的选择。

尝试这种分页方式。为我工作很好

def parse(self, response):

    posts = response.css('li.word-break')

    for item in posts:

        yield{

            'title': item.xpath('//a/@title').get(),
            'views': item.xpath("//a[contains(@class, 'view')]/text()").get(),
            'comments': item.xpath("//a[contains(@class, 'comment')]/text()").get(),
            'likes': item.xpath("//a[contains(@class, 'like')]/text()").get(),
            'link': item.xpath("//li/a/@href").get()
        } 

   next_page = response.css('[rel="next"]::attr("href")').get()
   if next_page is not None:
      yield response.follow(next_page, self.parse)

有趣的网站。向下滚动页面时，只有一个HTTP请求，响应只是一个有效负载。我不太清楚为什么来自POST HTTP请求的响应是有效负载。只有一个地方提到了有效负载，那就是http POST请求的响应。除非我弄错了，否则实现这一点的唯一方法是使用splash/selenium。我很想看看其他人能从中看到什么。好吧，虽然它确实要浏览多个页面，但它无法从页面中提取信息。输出数据只是每一页重复一行数据。

def parse(self, response):

    posts = response.css('li.word-break')

    for item in posts:

        yield{

            'title': item.xpath('//a/@title').get(),
            'views': item.xpath("//a[contains(@class, 'view')]/text()").get(),
            'comments': item.xpath("//a[contains(@class, 'comment')]/text()").get(),
            'likes': item.xpath("//a[contains(@class, 'like')]/text()").get(),
            'link': item.xpath("//li/a/@href").get()
        } 

   next_page = response.css('[rel="next"]::attr("href")').get()
   if next_page is not None:
      yield response.follow(next_page, self.parse)