Python 如何使用scrapy检查网站是否支持http、htts和www前缀_Python_Scrapy

Python 如何使用scrapy检查网站是否支持http、htts和www前缀

python scrapy

Python 如何使用scrapy检查网站是否支持http、htts和www前缀,python,scrapy,Python,Scrapy,当我使用http://example.com，https://example.com或http://www.example.com。当我创建scrapy请求时，它工作得很好。例如，在我的page1.com上，它总是被重定向到https://。我需要获取这些信息作为返回值，或者有更好的方法使用scrapy获取这些信息吗 class myspider(scrapy.Spider): name = 'superspider' start_urls = [ "https

当我使用

http://example.com

，

https://example.com

或

http://www.example.com

。当我创建scrapy请求时，它工作得很好。例如，在我的

page1.com

上，它总是被重定向到

https://

。我需要获取这些信息作为返回值，或者有更好的方法使用scrapy获取这些信息吗

class myspider(scrapy.Spider):
    name = 'superspider'

    start_urls = [
        "https://page1.com/"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        url = response.url
        # removing all possible prefixes from url
        for remove in ['https://', 'http://', 'www.']:
            url = str(url).replace(remove, '').rstrip('/')

        # Try with all possible prefixes
        for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
            yield scrapy.Request(url='{}{}'.format(prefix, url), callback=self.test, dont_filter=True)

    def test(self, response):
        print(response.url, response.status)

此spider的输出如下所示：

https://page1.com 200
https://page1.com/ 200
https://page1.com/ 200
https://page1.com/ 200

这很好，但我想将此信息作为返回值来了解，例如，在

http

上是响应代码200，然后将其保存到字典以便稍后处理，或者将其作为json保存到文件（使用scrapy中的项）

期望输出：我想要一本名为

的字典，里面有所有信息：

print(a)
{'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True}

稍后，我想收集更多信息，因此我需要将所有信息存储在一个对象/json/…

中。您在spider的开头执行了一个额外的请求，您可以使用

start\u requests

方法处理所有这些域：

class myspider(scrapy.Spider):
    name = 'superspider'

    def start_requests(self):
        url = response.url
        # removing all possible prefixes from url
        for remove in ['https://', 'http://', 'www.']:
            url = str(url).replace(remove, '').rstrip('/')

        # Try with all possible prefixes
        for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
            yield scrapy.Request(
                url='{}{}'.format(prefix, url), 
                callback=self.parse, 
                dont_filter=True, 
                meta={'prefix': prefix},
            )

    def parse(self, response):
        yield {response.meta['prefix']: True}

检查我是否正在使用

meta

request参数将信息传递给使用前缀的下一个回调方法。

而不是使用eLRuLL指出的meta可能性。您可以解析request.url：

scrapy shell http://stackoverflow.com
In [1]: request.url
Out[1]: 'http://stackoverflow.com'

In [2]: response.url
Out[2]: 'https://stackoverflow.com/'

要将不同运行的值存储在一个dict/json中，可以使用中提到的其他管道所以你有这样的想法：

Class WriteAllRequests(object):
    def __init__(self):
        self.urldic={}

    def process_item(self, item, spider):
        urldic[item.url]={item.urlprefix=item.urlstatus}
        if len(urldic[item.url])==4:
            # think this can be passed to a standard pipeline with a higher number
            writedata (urldic[item.url])

            del urldic[item.url]

您必须另外激活管道

是否能够更新所需输出？我插入了所需输出。谢谢。那么如何将

{response.meta['prefix']：True}

存储在某个变量中，比如说，在start\u requests函数中变量

myprefixes

？调用parse函数时，您的response.meta在哪里？正如我前面提到的，我知道如何检查和打印它，问题是，如何从函数中获取返回值并将其存储，例如作为字典。当我使用

回调

函数时，可以处理代码，但它不返回任何内容。其中一个选项是将其存储在redis中，当我从回调返回时，我可以读取并保存它，它可以工作，但根据我的说法，这不是解决它的最佳方法。好的，你的问题是你想在一个dict/json中存储4次不同运行的值。。。所以我改变了我的答案