Python Scrapy:最大文件大小错误
我在尝试使用Scrapy下载大的~1.8gb文件时遇到问题,我的代码是:Python Scrapy:最大文件大小错误,python,scrapy,Python,Scrapy,我在尝试使用Scrapy下载大的~1.8gb文件时遇到问题,我的代码是: import scrapy class CHSpider(scrapy.Spider): name = "ch_accountdata" allowed_domains = ['download.companieshouse.gov.uk'] start_urls = ['http://download.companieshouse.gov.uk/en_monthlyaccountsdata.ht
import scrapy
class CHSpider(scrapy.Spider):
name = "ch_accountdata"
allowed_domains = ['download.companieshouse.gov.uk']
start_urls = ['http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html']
custom_settings = {
'DOWNLOAD_WARNSIZE': 0,
}
def parse(self, response):
relative_url = response.xpath("//div[@class='grid_7 push_1 omega']/ul/li[12]/a/@href").extract()[0]
download_url = response.urljoin(relative_url)
yield {
'file_urls': [download_url]
}
这将返回一个错误:
2017-08-01 17:10:33[scrapy.utils.log]信息:scrapy 1.4.0已启动(机器人程序:开发)
2017-08-01 17:10:33[scrapy.utils.log]信息:覆盖的设置:{'NEWSPIDER_MODULE':'develop.SPIDER','SPIDER_MODULES':['develop.SPIDER'],'ROBOTSTXT_-obe':True,'BOT_-NAME':'develop'}
2017-08-01 17:10:33[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats']
2017-08-01 17:10:34[剪贴簿中间件]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.robotstxt.RobotsTxtMiddleware',
'scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2017-08-01 17:10:34[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2017-08-01 17:10:34[scrapy.middleware]信息:启用的项目管道:
['scrapy.pipeline.files.filespipline']
2017-08-01 17:10:34[刮屑.堆芯.发动机]信息:十字轴已打开
2017-08-01 17:10:34[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2017-08-01 17:10:34[scrapy.extensions.telnet]调试:telnet控制台在127.0.0.1:6024上侦听
2017-08-01 17:10:35[scrapy.core.engine]调试:爬网(404)(参考:无)
2017-08-01 17:10:35[刮屑核心引擎]调试:爬网(200)(参考:无)
2017-08-01 17:10:35[scrapy.core.downloader.handlers.http11]错误:取消下载http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip: 预期响应大小(1240658506)大于下载最大大小(1073741824)。
2017-08-01 17:10:35[scrapy.pipelines.files]警告:文件(未知错误):从中下载文件时出错:取消下载http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip: 预期响应大小(1240658506)大于下载最大大小(1073741824)。
2017-08-01 17:10:35[scrapy.core.scraper]调试:从
{'files':[],'file_url':[u'http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip']}
2017-08-01 17:10:35[刮屑芯发动机]信息:关闭卡盘(已完成)
2017-08-01 17:10:35[scrapy.statscollectors]信息:倾销scrapy统计数据:
{'downloader/exception_count':1,
'downloader/exception\u type\u count/twisted.internet.defer.CancelledError':1,
“下载程序/请求字节”:755,
“下载程序/请求计数”:3,
“下载程序/请求方法\计数/获取”:3,
“downloader/response_字节”:11061,
“下载程序/响应计数”:2,
“下载程序/响应状态\计数/200”:1,
“下载程序/响应状态\计数/404”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2017,8,1,16,10,35,806000),
“物料刮取计数”:1,
“日志计数/调试”:4,
“日志计数/错误”:1,
“日志计数/信息”:7,
“日志计数/警告”:1,
“响应\u已收到\u计数”:2,
“调度程序/出列”:1,
“调度程序/出列/内存”:1,
“调度程序/排队”:1,
“调度程序/排队/内存”:1,
“开始时间”:datetime.datetime(2017,8,1,16,10,34559000)}
2017-08-01 17:10:35[scrapy.core.engine]信息:Spider已关闭(完成)
我注意到错误日志与第一个错误日志之间存在差异,并将其与您提供的Spider脚本进行了比较
由于您只提供了spider,所以我可能无法获得全部信息,因此您还应该提供管道和整个设置文件。我将继续研究堆栈跟踪,这应该足以为您提供充分的答案
至于差异
yield {
'file_urls': [download_url]
}
#First Error Log Line 36
{'files': [], 'file_urls': [u'http://download.companieshouse.gov.uk/Accounts_Monthly_Data-June2017.zip']}
假设您可能没有深入阅读过scrapy的官方文档。在使用scrapy下载任何内容时,必须遵守以下几点:
'DOWNLOAD_MAXSIZE' : 0,
'DOWNLOAD_TIMEOUT': 600
python -c 'import sys;print("64bit" if sys.maxsize > 2**32 else "32bit")'