Python 使用scrapyjs爬网按splash单击页面
我正在尝试从使用javascript的页面获取url,比如Python 使用scrapyjs爬网按splash单击页面,python,scrapy,splash-screen,scrapyjs,Python,Scrapy,Splash Screen,Scrapyjs,我正在尝试从使用javascript的页面获取url,比如 <span onclick="go1()">click here </span> <script>function go1(){ window.location = "../innerpages/" + myname + ".php"; } </script> 如果我写 'js_source': 'document.title="hello world"' 它会
<span onclick="go1()">click here </span>
<script>function go1(){
window.location = "../innerpages/" + myname + ".php";
}
</script>
如果我写
'js_source': 'document.title="hello world"'
它会起作用的
似乎我可以处理页面内的文本,但无法从go1()获取url
如果我想在go1()中获取url,我应该怎么做
谢谢 您可以使用:
'js_source': 'document.title="hello world"'
class MySpider(scrapy.Spider):
...
def start_requests(self):
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(1))
assert(splash:runjs('document.getElementsByTagName("span")[0].click()'))
assert(splash:wait(1))
-- return result as a JSON object
return {
html = splash:html()
}
end
"""
for url in self.start_urls:
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {'lua_source': script},
'endpoint': 'execute',
}
})
def parse_result(self, response):
# fetch base URL because response url is the Splash endpoint
baseurl = response.meta["_splash_processed"]["args"]["url"]
# decode JSON response
splash_json = json.loads(response.body_as_unicode())
# and build a new selector from the response "html" key from that object
selector = scrapy.Selector(text=splash_json["html"], type="html")
...