Python 飞溅+;Scrapy,脚本嵌入,Scrapy extract()不工作
我的问题是我不能在我的Scrapy爬虫中嵌入Splash脚本,Splash正在工作,我设法在浏览器中呈现我想要的内容,所以我复制了脚本并尝试使用Scrapy解析html这是我的蜘蛛:Python 飞溅+;Scrapy,脚本嵌入,Scrapy extract()不工作,python,scrapy,scrapy-splash,Python,Scrapy,Scrapy Splash,我的问题是我不能在我的Scrapy爬虫中嵌入Splash脚本,Splash正在工作,我设法在浏览器中呈现我想要的内容,所以我复制了脚本并尝试使用Scrapy解析html这是我的蜘蛛: import scrapy from scrapy_splash import SplashRequest class Ntest(scrapy.Spider): name = "test" script = """ function main(splash)
import scrapy
from scrapy_splash import SplashRequest
class Ntest(scrapy.Spider):
name = "test"
script = """
function main(splash)
splash.private_mode_enabled = false
splash.html5_media_enabled = true
assert(splash:go(args.url))
assert(splash:wait(0.3))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
"""
def start_request(self, response):
yield SplashRequest(
url = 'https://www.mp4upload.com/embed-yfani9opk91x.html',
endpoint='render.html',
args={'lua_source': self.script},
callback=self.parse,
)
def parse(self, response):
r = response.css('body').extract()
这是我的settings.py:
SPLASH_URL = 'http://localhost:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
当我运行scrapy runspider时。\main.py
我明白了:
2018-06-25 14:17:38 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot:
scrapybot)
2018-06-25 14:17:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.2.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-06-25 14:17:39 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_LOADER_WARN_ONLY': True}
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-25 14:17:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Spider opened
2018-06-25 14:17:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-25 14:17:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-25 14:17:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 25, 12, 17, 39, 112025),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2018, 6, 25, 12, 17, 39, 104037)}
2018-06-25 14:17:39 [scrapy.core.engine] INFO: Spider closed (finished)
我应该从html中提取正文,请帮助。从日志中,很明显没有执行任何请求 如果代码在文章中缩进,
start\u request()
和parse()
在spider类之外定义。
即使不是,正确的方法名称也是start\u requests()