scrapy splash脚本找不到CSS选择器
我正在尝试制作一个scrapy splash脚本,从以下位置获取食物项目的链接: 当您第一次访问它时,它将使您选择一个区域。我想通过在下面的代码中设置cookies dict,我已经正确地处理了这个问题。我正在尝试获取传送带中所有食物项目的链接。我之所以使用splash,是因为旋转木马是由javascript生成的,而使用beautiful soup进行的常规请求和解析不会在html中显示它。我的问题是我没有把任何数据输入我的“项目”目录scrapy splash脚本找不到CSS选择器,scrapy,splash-screen,scrapy-splash,Scrapy,Splash Screen,Scrapy Splash,我正在尝试制作一个scrapy splash脚本,从以下位置获取食物项目的链接: 当您第一次访问它时,它将使您选择一个区域。我想通过在下面的代码中设置cookies dict,我已经正确地处理了这个问题。我正在尝试获取传送带中所有食物项目的链接。我之所以使用splash,是因为旋转木马是由javascript生成的,而使用beautiful soup进行的常规请求和解析不会在html中显示它。我的问题是我没有把任何数据输入我的“项目”目录 import scrapy from scrapy_s
import scrapy
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ["https://www.realcanadiansuperstore.ca/Food/Meat-%26-
Seafood/c/RCSS001004000000"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, cookies={'currentRegion' :'CA-BC'},
callback = self.parse, endpoint = 'render.html', args = {'wait':0.5},
)
def parse(self, response):
item = {}
item['urls'] = []
itemList = response.css('div.product-name-wrapper > a > ::attr(href)').extract()
for links in itemList:
item['urls'].append(links)
yield item
我认为我的cookie设置不正确,因此它会将我带到需要选择区域的页面
顺便说一下,我也在docker控制台上运行了splash。如果我在浏览器中转到本地主机,它将显示启动页面
以下是我从爬行蜘蛛中获得的输出:
<GET https://www.realcanadiansuperstore.ca/Food/Meat-%26-
Seafood/c/RCSS001004000000 via http://localhost:8050/render.html>
(referer: None)
2017-07-04 16:44:05 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.realcanadiansuperstore.ca/Food/Meat-%26-
Seafood/c/RCSS001004000000>
{'urls': []}
但这是在浏览器中作为脚本输入的。如何将其应用于python脚本?在Python中添加cookie有不同的方法吗?如果有适合您的脚本,您可以使用/execute endpoint来执行此脚本:
yield SplashRequest(url, endpoint='execute', args={'lua_source': my_script})
scrapy splash还允许设置透明的cookie处理,以便cookie在splash请求中保持不变,就像在常规scrapy请求中一样。请求:
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class MySpider(scrapy.Spider):
# def my_parse...
# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
)
def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.
请参阅scrapy splash自述
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class MySpider(scrapy.Spider):
# def my_parse...
# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
)
def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.