Python Scrapy splash spider没有跟随链接获取新页面
我从一个使用Javascript链接到新页面的页面获取数据。我正在使用Scrapy+splash来获取这些数据,但是,由于某些原因,没有遵循这些链接 以下是我的spider的代码:Python Scrapy splash spider没有跟随链接获取新页面,python,scrapy,scrapy-splash,splash-js-render,Python,Scrapy,Scrapy Splash,Splash Js Render,我从一个使用Javascript链接到新页面的页面获取数据。我正在使用Scrapy+splash来获取这些数据,但是,由于某些原因,没有遵循这些链接 以下是我的spider的代码: import scrapy from scrapy_splash import SplashRequest script = """ function main(splash, args) local javascript = args.javascript
import scrapy
from scrapy_splash import SplashRequest
script = """
function main(splash, args)
local javascript = args.javascript
assert(splash:runjs(javascript))
splash:wait(0.5)
return {
html = splash:html()
}
end
"""
page_url = "https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/exchange-insight/trade-data.html?page=0&pageOffBook=0&fourWayKey=GB00B6774699GBGBXAMSM&formName=frmRow&upToRow=-1"
class MySpider(scrapy.Spider):
name = "foo_crawler"
download_delay = 5.0
custom_settings = {
'DOWNLOADER_MIDDLEWARES' : {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
#'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'
}
def start_requests(self):
yield SplashRequest(url=page_url,
callback=self.parse
)
# Parses first page of ticker, and processes all maturities
def parse(self, response):
try:
self.extract_data_from_page(response)
href = response.xpath('//div[@class="paging"]/p/a[contains(text(),"Next")]/@href')
print("href: {0}".format(href))
if href:
javascript = href.extract_first().split(':')[1].strip()
yield SplashRequest(response.url, self.parse,
cookies={'store_language':'en'},
endpoint='execute',
args = {'lua_source': script, 'javascript': javascript })
except Exception as err:
print("The following error occured: {0}".format(err))
def extract_data_from_page(self, response):
url = response.url
page_num = url.split('page=')[1].split('&')[0]
print("extract_data_from_page() called on page: {0}.".format(url))
filename = "page_{0}.html".format(page_num)
with open(filename, 'w') as f:
f.write(response.text)
def handle_error(self, failure):
print("Error: {0}".format(failure))
只获取第一页,我无法通过“单击”页面底部的链接来获取后续页面
如何修复此问题,以便单击页面底部给定的页面?您的代码看起来不错,唯一的问题是,由于生成的请求具有相同的url,因此它们被重复筛选器忽略。只需取消对DUPEFILTER_类的注释,然后重试
custom_settings = {
...
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
}
编辑:要在不运行javascript的情况下浏览数据页,可以执行以下操作:
page_url = "https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/exchange-insight/trade-data.html?page=%s&pageOffBook=0&fourWayKey=GB00B6774699GBGBXAMSM&formName=frmRow&upToRow=-1"
page_number_regex = re.compile(r"'frmRow',(\d+),")
...
def start_requests(self):
yield SplashRequest(url=page_url % 0,
callback=self.parse)
...
if href:
javascript = href.extract_first().split(':')[1].strip()
matched = re.search(self.page_number_regex, javascript)
if matched:
yield SplashRequest(page_url % matched.group(1), self.parse,
cookies={'store_language': 'en'},
endpoint='execute',
args={'lua_source': script, 'javascript': javascript})
不过,我期待着使用javascript的解决方案。您可以使用
页面
查询字符串变量。它从0开始,因此第一页是page=0
。您可以通过查看以下内容查看总页面:
<div class="paging">
<p class="floatsx"> Page 1 of 157 </p>
</div>
第1页,共157页
这样,您就知道如何调用第0-156页。只是为了确定,打印的href是否正确?@matthieu.cham是的(虽然是javascript)。@malberts谢谢。修改代码段以提取有效的Javascript。除非我弄错了,否则Lua脚本不应该打开URL,而是执行传入的Javascript字符串——Lua片段正在执行该字符串……是的,生成的请求具有相同的URL——这与浏览器的行为不同。似乎javascript没有运行。问题是(正如您正确地发现的那样),URL不会改变——而当在浏览器中“单击”链接时,URL会改变。挑战在于如何使用Scrapy+SplashOk复制这种“点击”行为对不起,我理解这个问题太慢了。我没有让scrapy运行javascript atm的解决方案,但是如果您想浏览所有数据页,只需从javascript片段中提取下一个页码并将其传递到生成的请求中即可。见下面我的答案