爬虫获取有关页面的信息（Scrapy）_Scrapy_Web Crawler

爬虫获取有关页面的信息（Scrapy）

scrapy web-crawler

爬虫获取有关页面的信息（Scrapy）,scrapy,web-crawler,Scrapy,Web Crawler,如何实现获取页面所有信息的爬虫程序（使用SCRAPY）。例如，图像大小、CSS文件大小和保存在.txt文件（page1.txt、page2.txt）中我尝试了以下图像： class TestSpider(scrapy.Spider): name="Test" start_urls = ["http://www.example.com/page1.html", "http://www.example.com/page2"

如何实现获取页面所有信息的爬虫程序（使用SCRAPY）。例如，图像大小、CSS文件大小和保存在.txt文件（page1.txt、page2.txt）中

我尝试了以下图像：

class TestSpider(scrapy.Spider):

    name="Test"
    start_urls = ["http://www.example.com/page1.html", "http://www.example.com/page2", ]

    def start_requests(self):
            for url in self.start_urls:
                    yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait':5})

    def parse(self, response):

            for url_image in response.xpath('//img/@src').extract():
                    yield scrapy.Request(url=url_image, callback=self.parse_image)

    def parse_image(self, response):
            with open('page1.txt', 'a+') as f:
                    f.write(str(len(response.body)))

这段代码会将所有大小的图像保存在page1.txt中，如何将参数发送到parse_image（）？例如，用于解析_image（）函数的文件名

Splash browser正是我所需要的-->。

要在解析方法之间传输数据，可以使用

Request

meta

属性：

def parse(self, response):
    data = {'foo': 'bar'}
    yield Request(url, self.parse2, meta=data)

def parse2(self, response):
    data = response.meta
    # {'foo': 'bar'}

完美的还有一个问题，我可以从parse2（）返回任何值吗？Parse2（）方法是否只返回scrapy.Request类？？是的，您可以，scrapy解析方法应返回或生成请求或项/字典对象。