Javascript 码头工人溅起刮痕,不工作

Javascript 码头工人溅起刮痕,不工作,javascript,docker,scrapy,ubuntu-16.04,scrapy-splash,Javascript,Docker,Scrapy,Ubuntu 16.04,Scrapy Splash,我正试图抓取一个网站,该网站启用了scrapy splash插件的javascript 我已经用我正在使用的ubuntu16.04 $ sudo docker pull scrapinghub/splash $ sudo docker run -p 8050:8050 scrapinghub/splash 我有一个正在运行的splash docker,就像那样一切看起来都很好,但是 splash在处理刮擦错误时抛出此错误 2017-07-20 03:03:23+0000 [-] Log ope

我正试图抓取一个网站,该网站启用了scrapy splash插件的javascript

我已经用我正在使用的ubuntu16.04

$ sudo docker pull scrapinghub/splash
$ sudo docker run -p 8050:8050 scrapinghub/splash
我有一个正在运行的splash docker,就像那样一切看起来都很好,但是

splash在处理刮擦错误时抛出此错误

2017-07-20 03:03:23+0000 [-] Log opened.
2017-07-20 03:03:23.870491 [-] Splash version: 3.0
2017-07-20 03:03:24.007457 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
2017-07-20 03:03:24.007614 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]
2017-07-20 03:03:24.007746 [-] Open files limit: 65536
2017-07-20 03:03:24.007879 [-] Can't bump open files limit
2017-07-20 03:03:24.291391 [-] Xvfb is started: ['Xvfb', ':911054901', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
2017-07-20 03:03:43.425858 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2017-07-20 03:04:09.534239 [-] verbosity=1
2017-07-20 03:04:09.534387 [-] slots=50
2017-07-20 03:04:09.534499 [-] argument_cache_max_entries=500
2017-07-20 03:04:09.534974 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
2017-07-20 03:04:09.535774 [-] Site starting on 8050
2017-07-20 03:04:09.535904 [-] Starting factory <twisted.web.server.Site object at 0x7f0e78e18d30>
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text
**See the manual page for dbus-uuidgen to correct this issue.
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method**
而且该网站是HTTPS网站

发痒 我在scrapy

from scrapy_splash import SplashRequest
我提出这样的要求

yield SplashRequest(link, meta={'item': item}, callback=self.parse_data)
而不是

yield scrapy.Request(link, meta={'item': item}, callback=self.parse_data)
但像往常一样,splash不处理这些请求

我做错了什么?Ubuntu有什么问题吗

剪贴式调试输出
crawl sofaspider-o out.csv
2017-07-20 13:03:40[scrapy.utils.log]信息:scrapy 1.4.0已启动(机器人:sofa)
2017-07-20 13:03:40[scrapy.utils.log]信息:覆盖的设置:{'NEWSPIDER_MODULE':'sofa.SPIDER','FEED_URI':'out.csv','DUPEFILTER_CLASS':'scrapy_splash.SplashAwareDupeFilter','SPIDER_MODULES':['sofaf.SPIDER','BOT_NAME':'sofa','USER_AGENT':'Mozilla/5.0(Windows NT 6.2;WOW64)AppleWebKit/537.36(KHTML,如Gecko)Chrome/27.0.1453.93 Safari/537.36,“提要格式”:“csv”}
2017-07-20 13:03:40[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats']
2017-07-20 13:03:40[scrapy.middleware]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
“刮擦飞溅,飞溅”,
"刮花,溅花",,
'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2017-07-20 13:03:40[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
“刮花,刮花”,
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2017-07-20 13:03:40[scrapy.middleware]信息:启用的项目管道:
[]
2017-07-20 13:03:40[刮屑.堆芯.发动机]信息:十字轴已打开
2017-07-20 13:03:40[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2017-07-20 13:03:40[scrapy.extensions.telnet]调试:telnet控制台在127.0.0.1:6023上侦听
2017-07-20 13:03:45[刮屑核心引擎]调试:爬网(200)(参考:无)
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:17[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:40[scrapy.extensions.logstats]信息:抓取1页(每分钟1页),抓取0项(每分钟0项)
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时
2017-07-20 13:04:47[scrapy.downloadermiddleware.retry]调试:重试(失败1次):504网关超时

您是否尝试在浏览器中打开Splash控制台(即,将浏览器指向安装了Splash的主机上的端口
8050
)?另外,你能分享spider run的碎片日志吗?是的,我通过访问尝试了splash,我现在会更新这个问题。我还看到splash日志中的
qt.network.ssl:qsslslslssocket:无法解析SSLv2_client_方法
,但我能够呈现raymourflanigan.com上的页面。您的并发级别是什么?也许您的Splash实例无法处理来自Scrapy的
并发请求的负载?如果降低并发性,会有什么变化吗?或者启用Autothrottle?然后您可以使用您的Splash日志和scrapy Logstack打开Splash或scrapy Splash上的bug查看常见问题获取网关超时。另外,我会先增加飞溅的日志级别(请参阅)。你呢
yield scrapy.Request(link, meta={'item': item}, callback=self.parse_data)
crawl sofaspider -o out.csv
2017-07-20 13:03:40 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: sofa)
2017-07-20 13:03:40 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sofa.spiders', 'FEED_URI': 'out.csv', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['sofa.spiders'], 'BOT_NAME': 'sofa', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'FEED_FORMAT': 'csv'}
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-20 13:03:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-07-20 13:03:40 [scrapy.core.engine] INFO: Spider opened
2017-07-20 13:03:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-20 13:03:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-20 13:03:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.raymourflanigan.com/Sofas.aspx> (referer: None)
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/willoughby-sofa-200326456.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/union-square-sofa-200223105.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/castin-microfiber-sofa-200278403.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/toby-microfiber-leather-look-reclining-sofa-200217215.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/bryant-II-leather-power-reclining-sofa-217282538.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/crosby-sofa-with-chaise-200235097.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/anastasia-sofa-200209167.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/stylus-power-reclining-sofa-202239352.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:40 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/cordelia-sofa-200211201.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/ellington-leather-power-reclining-sofa-202291427.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/delano-power-reclining-sofa-200212520.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/quincey-power-reclining-sofa-200215627.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/corliss-sofa-200331104.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/skye-microfiber-power-reclining-sofa-200320074.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/mckinley-sofa-200211302.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2017-07-20 13:04:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.raymourflanigan.com/diana-sofa-200345115.aspx via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out