Python 如何发送带有爬行蜘蛛请求的cookie？_Python_Cookies_Web Scraping_Scrapy

Python 如何发送带有爬行蜘蛛请求的cookie？

python cookies web-scraping scrapy

Python 如何发送带有爬行蜘蛛请求的cookie？,python,cookies,web-scraping,scrapy,Python,Cookies,Web Scraping,Scrapy,我试图使用Python的框架创建这个刮板我曾用虎皮鹦鹉在Reddit及其子Reddits中爬行。但是，当我看到包含成人内容的页面时，网站会要求提供cookieover18=1 所以，我一直在尝试发送一个cookie，其中包含蜘蛛发出的每个请求，但是，它没有成功这是我的蜘蛛代码。如您所见，我尝试使用start\u requests（）方法为每个spider请求添加cookie 这里有人能告诉我怎么做吗？或者我做错了什么 from scrapy import Spider from scrapy

我试图使用Python的框架创建这个刮板
我曾用虎皮鹦鹉在Reddit及其子Reddits中爬行。但是，当我看到包含成人内容的页面时，网站会要求提供cookie
over18=1
所以，我一直在尝试发送一个cookie，其中包含蜘蛛发出的每个请求，但是，它没有成功
这是我的蜘蛛代码。如您所见，我尝试使用
start\u requests（）
方法为每个spider请求添加cookie
这里有人能告诉我怎么做吗？或者我做错了什么

from scrapy import Spider from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from reddit.items import RedditItem from scrapy.http import Request, FormRequest class MySpider(CrawlSpider): name = 'redditscraper' allowed_domains = ['reddit.com', 'imgur.com'] start_urls = ['https://www.reddit.com/r/nsfw'] rules = ( Rule(LinkExtractor( allow=['/r/nsfw/\?count=\d*&after=\w*']), callback='parse_item', follow=True), ) def start_requests(self): for i,url in enumerate(self.start_urls): print(url) yield Request(url,cookies={'over18':'1'},callback=self.parse_item) def parse_item(self, response): titleList = response.css('a.title') for title in titleList: item = RedditItem() item['url'] = title.xpath('@href').extract() item['title'] = title.xpath('text()').extract() yield item

好的。试试这样做

def start_requests(self): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'} for i,url in enumerate(self.start_urls): yield Request(url,cookies={'over18':'1'}, callback=self.parse_item, headers=headers)
是用户代理阻止了你
编辑：
不知道爬行爬行器有什么问题，但是爬行爬行器仍然可以工作

#!/usr/bin/env python # encoding: utf-8 import scrapy class MySpider(scrapy.Spider): name = 'redditscraper' allowed_domains = ['reddit.com', 'imgur.com'] start_urls = ['https://www.reddit.com/r/nsfw'] def request(self, url, callback): """ wrapper for scrapy.request """ request = scrapy.Request(url=url, callback=callback) request.cookies['over18'] = 1 request.headers['User-Agent'] = ( 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, ' 'like Gecko) Chrome/45.0.2454.85 Safari/537.36') return request def start_requests(self): for i, url in enumerate(self.start_urls): yield self.request(url, self.parse_item) def parse_item(self, response): titleList = response.css('a.title') for title in titleList: item = {} item['url'] = title.xpath('@href').extract() item['title'] = title.xpath('text()').extract() yield item url = response.xpath('//a[@rel="nofollow next"]/@href').extract_first() if url: yield self.request(url, self.parse_item) # you may consider scrapy.pipelines.images.ImagesPipeline :D

您也可以通过标题发送它

scrapy.Request(url=url, callback=callback, headers={'Cookie':my_cookie})
1.使用dict：

request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'})
2.使用目录列表：

request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])

您可以在规则中使用process_request参数，例如：

rules = ( Rule(LinkExtractor( allow=['/r/nsfw/\?count=\d*&after=\w*']), callback='parse_item', process_request='ammend_req_header', follow=True) def ammend_req_header(self, request): request.cookies['over18']=1 return request

cookie是否在
请求中。cookie
？@esfy不，我想不是。我已经在
请求中指定了cookie（url，cookies={'over18'：'1'}，callback=self.parse_item）
它工作了。但在我接受您的回答之前，我认为cookie只适用于第一个请求，而不适用于分页请求。i、它只适用于
start\u URL
，但不适用于我们从
LinkExtractor
获取的分页URL。事实上，问题在于，如果我使用
start\u requests（）
方法，爬网会在一页处停止。但当我移除它时，它开始爬行分页。不知道为什么！噢，客户端设置的cookie不会像服务器发送的请求那样在请求之间保持自己。也许
cookiejar
就可以了。它可以工作了。伟大的谢谢你的帮助。我不知道你发布了一个编辑。有趣的是，我没有收到任何通知。不管怎样，我们必须研究一下爬行蜘蛛的问题所在