Python 使用带有scrapy和splash的javascript递归抓取同一页面_Python_Scrapy_Web Crawler_Scrapy Splash_Scrapyjs

Python 使用带有scrapy和splash的javascript递归抓取同一页面

python scrapy web-crawler

Python 使用带有scrapy和splash的javascript递归抓取同一页面,python,scrapy,web-crawler,scrapy-splash,scrapyjs,Python,Scrapy,Web Crawler,Scrapy Splash,Scrapyjs,我正在抓取一个有javascript的网站，以便转到下一页。我正在使用splash在第一页上执行javascript代码。但我能进入第二页。但是我不能去3，4，5。。。。页。爬行仅在一页后停止我正在爬行的链接：守则： import scrapy from scrapy_splash import SplashRequest from time import sleep class MSEDCLSpider(scrapy.Spider): name = "msedcl_spide

我正在抓取一个有javascript的网站，以便转到下一页。我正在使用splash在第一页上执行javascript代码。但我能进入第二页。但是我不能去3，4，5。。。。页。爬行仅在一页后停止

我正在爬行的链接：

守则：

import scrapy
from scrapy_splash import SplashRequest
from time import sleep


class MSEDCLSpider(scrapy.Spider):
    name = "msedcl_spider"
    scope_path = 'body > table:nth-child(11) tr > td.content_area > table:nth-child(4) tr:not(:first-child)'
    ref_no_path = "td:nth-child(1) ::text"
    title_path = "td:nth-child(2) ::text"
    end_date_path = "td:nth-child(5) ::text"
    fee_path = "td:nth-child(6) ::text"
    start_urls = ["http://59.180.234.21:8788/user/viewallrecord.aspx"]

    lua_src = """function main(splash)
        local url = splash.args.url
        splash:go(url)
        splash:wait(2.0)
        splash:runjs("document.querySelectorAll('#lnkNext')[0].click()")
        splash:wait(4.0)
        return {
            splash:html(),
        }
        end
        """

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
                endpoint='execute',
                method='POST',
                dont_filter=True,
                args={
                    'wait': 1.0,
                    'lua_source': self.lua_src,
                },
            )


    def parse(self, response):
        print response.status
        scopes = response.css('#page-info').extract()[0]
        print(response.url)
        print(scopes)

我对刮痧和泼溅都是新手。请温柔一点。谢谢

我可以看到两个问题：

你不是在提出这些要求。在start_请求中，只发出一个请求，响应在self.parse方法中解析，但不会发送到第三页和其他页的请求。为此，您需要从.parse方法发送一些请求

如果您修复了（1），那么您可能会面临下一个问题：Splash不会在请求之间保持页面状态。将每个请求视为打开一个新的私有模式浏览器窗口并执行一些操作；这是故意的。但是这个网站的问题是URL在页面之间不会改变，所以你不能从第三页开始，然后点击“下一页”

但我认为有办法解决这个问题。也许你可以在点击后获得页面html，然后使用将其加载到浏览器中；你也可以保存cookies——在scrapy splash自述中有一个例子；虽然这个网站似乎并不依赖cookies来分页

另一种方法是编写一个脚本，加载所有页面，而不仅仅是下一页，然后将所有页面的内容返回给客户端。类似这样（未经测试）：

要使其工作，您需要更大的超时值；您可能还需要使用更大的--max timeout选项启动Splash。

主代码中没有缩进问题。当我粘贴代码时，它被修改了。我认为你在混合空格和制表符（至少在粘贴的代码中）。尝试使用所有空格（每个选项卡4个空格）格式化问题中的代码。问题不在于缩进。不管怎样，我编辑了这篇文章并修改了它，谢谢你的回答。第二种方法有什么缺点吗？比如性能、内存使用等，因为还有一些其他站点需要抓取超过200页。@REDDYPRASAD第二种方法更难监控和调试，如果出现错误，您无法获得部分结果并继续（除非您以处理该问题的方式编写脚本）。

function main(splash) 
    splash:go(splash.args.url)
    local pages = {splash:html()}
    for i = 2,100 do             
        splash:runjs("document.querySelectorAll('#lnkNext')[0].click()")            
        splash:wait(4)
        pages[i] = splash:html()
    end
    return pages
end