使用scrapy python处理加载更多请求_Python_Web Scraping_Scrapy_Scrapy Spider

使用scrapy python处理加载更多请求

python web-scraping scrapy

使用scrapy python处理加载更多请求,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我正在尝试使用scrapy刮取一个站点，我的蜘蛛如下所示： class AngelSpider(Spider): name = "angel" allowed_domains = ["angel.co"] start_urls = ( "https://angel.co/companies?locations[]=India", ) def start_requests(self): page_size = 25

我正在尝试使用scrapy刮取一个站点，我的蜘蛛如下所示：

class AngelSpider(Spider):


    name = "angel"
    allowed_domains = ["angel.co"]

    start_urls = (

        "https://angel.co/companies?locations[]=India",


    )
    def start_requests(self):
        page_size = 25
        headers ={

        'Host': 'angel.co',
        'Origin': 'https://angel.co',
        'User-Agent': 'Scrapy spider',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Accept':   '*/*',
        'X-Requested-With': 'XMLHttpRequest',
        'Referer': 'https://angel.co/companies?locations[]=India',

        'Accept-Language': 'en-US',
        }



        for offset in (0, 200, page_size):
            yield Request('https://angel.co/company_filters/search_data',
                          method='POST',
                          headers=headers,
                          body=urllib.urlencode(
                              {'action': 'more',
                               'filter_data[locations][]':'India',
                                 'sort':'signal',
                                 'page':2}))




    def parse(self, response):
        nestd =[]
        company = {}
        val = response.xpath('//div[@data-_tn = "companies/trending/row" ]')
        company_name = response.xpath("//div[@data-_tn = 'companies/trending/row' ]//div//div//div//div[@class='name']//text()").extract()
        #company_link = val.xpath("//div//div//div[@class ='photo']//@href").extract()
        #company_tag_line =val.xpath("//div//div//div//div//div[@class='pitch u-colorGray6']//text()").extract()
        #company_from = val.xpath("//div//div//div//div//a[@name]//text()").extract()
        print company_name

但它不会产生任何数据。有没有其他方法可以模拟load more articles（加载更多文章）按钮来加载文章并继续刮板？

您试图删除的网站使用javascript，您必须使用或模拟浏览器。

据我所见，网站首先向发送POST请求，返回包含启动ID的JSON数据，像这样：

{
    "ids": [
        146538,277273,562440,67592,124939,...,460951
    ],
    "total": 18443,
    "page": 2,
    "sort": "signal",
    "new": false,
    "hexdigest": "a8ef7331cba6a01e5d2fc8f5cc3e04b69871f62f"
}

之后，网站向发送GET请求，将上面JSON中的值作为URL参数传递

因此，在您的

start\u请求中生成的请求应该由另一个回调来处理，该回调应该读取作为响应返回的JSON数据，并构建URL以HTML格式获取实际的startups列表。
在这种情况下，您实际上不需要使用JS引擎，您需要的所有信息在几个XmlHttpRequests之后返回。查看这篇文章：我使用了selenium，但仍然无法刮取到我的spider的数据链接。请查看它，以及基于“ID”的这些参数到底应该是什么样子？你能举个例子吗？r=requests.get（“，headers={'content-type'：'application/json'}，params=urllib.urlencode（{“startup_id”：3725508}）不起作用