Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮痧:结合绝对和;相对链接-缺少模式_Python_Url_Scrapy - Fatal编程技术网

Python 刮痧:结合绝对和;相对链接-缺少模式

Python 刮痧:结合绝对和;相对链接-缺少模式,python,url,scrapy,Python,Url,Scrapy,我是新来的scrapy,正在努力加入一个绝对和相对的链接,错误是:请求URL中缺少方案。奇怪的是,当我打印URL时,它似乎是正确的URL 我从stackoverflow尝试了许多不同的解决方案,但似乎没有取得任何进展,任何帮助都将不胜感激 我的代码: import scrapy class CHSpider(scrapy.Spider): name = "ch_companydata" allowed_domains = ['download.companieshouse.go

我是新来的scrapy,正在努力加入一个绝对和相对的链接,错误是:请求URL中缺少方案。奇怪的是,当我打印URL时,它似乎是正确的URL

我从stackoverflow尝试了许多不同的解决方案,但似乎没有取得任何进展,任何帮助都将不胜感激

我的代码:

import scrapy

class CHSpider(scrapy.Spider):
    name = "ch_companydata"
    allowed_domains = ['download.companieshouse.gov.uk']
    start_urls = ['http://download.companieshouse.gov.uk/en_output.html']

    custom_settings = {
        'DOWNLOAD_WARNSIZE': 0
    }

    def parse(self, response):
        relative_url = response.xpath("//div[@class='grid_7 push_1 omega']/ul[2]/li[1]/a/@href").extract()[0]
        download_url = response.urljoin(relative_url)
        print(download_url)
        yield { 
            'file_urls': download_url
        }
错误消息:

2017-08-01 09:46:36[scrapy.utils.log]信息:scrapy 1.4.0已启动(bot:companieshouse)
2017-08-01 09:46:36[scrapy.utils.log]信息:覆盖的设置:{'NEWSPIDER_MODULE':'companieshouse.SPIDER','SPIDER_MODULES':['companieshouse.SPIDER','ROBOTSTXT_-obe':True,'BOT_-NAME':'companieshouse']
2017-08-01 09:46:36[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats']
2017-08-01 09:46:37[剪贴簿中间件]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.robotstxt.RobotsTxtMiddleware',
'scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2017-08-01 09:46:37[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2017-08-01 09:46:37[碎片中间件]信息:启用的项目管道:
['scrapy.pipeline.files.filespipline']
2017-08-01 09:46:37[刮屑.堆芯.发动机]信息:星形轮已打开
2017-08-01 09:46:37[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2017-08-01 09:46:37[scrapy.extensions.telnet]调试:telnet控制台在127.0.0.1:6023上侦听
2017-08-01 09:46:37[scrapy.core.engine]调试:爬网(404)(参考:无)
2017-08-01 09:46:37[刮屑核心引擎]调试:爬网(200)(参考:无)
http://download.companieshouse.gov.uk/BasicCompanyData-2017-08-01-part1_5.zip
2017-08-01 09:46:37[scrapy.core.scraper]错误:处理{'file\u url':u'时出错http://download.companieshouse.gov.uk/BasicCompanyData-2017-08-01-part1_5.zip'}
回溯(最近一次呼叫最后一次):
文件“c:\python27\lib\site packages\twisted\internet\defer.py”,第653行,在runCallbacks中
current.result=回调(current.result,*args,**kw)
文件“c:\python27\lib\site packages\scrapy\pipelines\media.py”,第79行,进程中项目
请求=arg_to_iter(self.get_media_请求(项目、信息))
get\U media\U请求中的文件“c:\python27\lib\site packages\scrapy\pipelines\files.py”,第382行
返回[item.get(self.files\u url\u字段,[])中的x请求(x)]
文件“c:\python27\lib\site packages\scrapy\http\request\\ uuuu init\uuuu.py”,第25行,在\uuu init中__
自我设置url(url)
文件“c:\python27\lib\site packages\scrapy\http\request\\uuuu init\uuuuu.py”,第58行,在\u set\u url中
raise VALUERROR('请求url中缺少方案:%s'%self.\u url)
ValueError:请求url:h中缺少方案
2017-08-01 09:46:37[刮屑芯发动机]信息:关闭卡盘(已完成)
2017-08-01 09:46:37[scrapy.statscollectors]信息:倾销scrapy统计数据:
{'downloader/request_bytes':480,
“下载程序/请求计数”:2,
“下载器/请求\方法\计数/获取”:2,
“downloader/response_字节”:8455,
“下载程序/响应计数”:2,
“下载程序/响应状态\计数/200”:1,
“下载程序/响应状态\计数/404”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2017,8,1,8,46,37415000),
“日志计数/调试”:3,
“日志计数/错误”:1,
“日志计数/信息”:7,
“响应\u已收到\u计数”:2,
“调度程序/出列”:1,
“调度程序/出列/内存”:1,
“调度程序/排队”:1,
“调度程序/排队/内存”:1,
“开始时间”:datetime.datetime(2017,8,1,8,46,3769000)}

2017-08-01 09:46:37[scrapy.core.engine]信息:爬行器已关闭(完成)
文件\u URL
字段需要包含URL列表。
因此,您应该得到以下欧洲工商管理学院:

    yield { 
        'file_urls': [download_url]
    }

文件\u URL
字段需要包含URL列表。
因此,您应该得到以下欧洲工商管理学院:

    yield { 
        'file_urls': [download_url]
    }