Python 3.x 爬行蜘蛛不';t调用self.parse()
我有一个爬行蜘蛛脚本,它使用splash登录javascript页面。但是,在成功登录之后,继承的self.parse()函数似乎没有被调用。爬行第一页后,爬行器关闭 我认为爬行蜘蛛会在start_请求产生响应后自动调用self.parse方法。但即使使用显式回调,self.parse似乎也不会被调用 我做错了什么 剧本:Python 3.x 爬行蜘蛛不';t调用self.parse(),python-3.x,scrapy,scrapy-spider,scrapy-splash,Python 3.x,Scrapy,Scrapy Spider,Scrapy Splash,我有一个爬行蜘蛛脚本,它使用splash登录javascript页面。但是,在成功登录之后,继承的self.parse()函数似乎没有被调用。爬行第一页后,爬行器关闭 我认为爬行蜘蛛会在start_请求产生响应后自动调用self.parse方法。但即使使用显式回调,self.parse似乎也不会被调用 我做错了什么 剧本: #!/usr/bin/env python3 from scrapy.linkextractors import LinkExtractor from scrapy.spid
#!/usr/bin/env python3
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from harvest.items import HarvestItem
from scrapy_splash import SplashRequest
class TestSpider(CrawlSpider):
name = 'test'
allowed_domains = ['test.secure.force.com', 'login.salesforce.com']
login_url = 'https://test.secure.force.com/jSites_Home'
rules = (Rule(LinkExtractor(restrict_xpaths='//*[@id="nav"]/ul/li/a[@title="Assignments"]')),
Rule(LinkExtractor(restrict_xpaths='//*/table/tbody/tr[2]/td[1]/a'), callback='parse_item'),
)
def start_requests(self):
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(10))
splash:set_viewport_full()
local search_input = splash:select('input[name=username]')
search_input:send_text("someuser")
local search_input = splash:select('input[name=pw]')
search_input:send_text("p4ssw0rd")
assert(splash:wait(5))
local submit_button = splash:select('input[class=btn]')
submit_button:click()
assert(splash:wait(10))
return {html = splash:html(),}
end
"""
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko'
') Chrome/55.0.2883.95 Safari/537.36'}
yield SplashRequest(url=self.login_url,
callback=self.parse,
endpoint='execute',
args={
'lua_source': script,
'wait': 5
},
splash_headers=headers,
headers=headers)
def parse_item(self, response):
items = HarvestItem()
items['start'] = response.xpath('(//*/table[@class="detailList"])[3]/tbody/tr[1]/td[1]/span/text()').extract()
return items