Python ValueError:请求url中缺少方案:h 5
我使用Scrapy编写了一个爬行器来从网站获取图像。但当我运行此爬行器时,会出现此错误。以下是我关于获取img_url的代码:Python ValueError:请求url中缺少方案:h 5,python,scrapy,Python,Scrapy,我使用Scrapy编写了一个爬行器来从网站获取图像。但当我运行此爬行器时,会出现此错误。以下是我关于获取img_url的代码: img_url = div.find_all("img",class_="img-responsive img-thumbnail center-block")[0]['src'] 当我把img_url放到浏览器中时,我可以得到图像。但是当我通过spider下载图像时,它会引发错误 File "C:\Python27\lib\site-packages\scrap
img_url = div.find_all("img",class_="img-responsive img-thumbnail center-block")[0]['src']
当我把img_url放到浏览器中时,我可以得到图像。但是当我通过spider下载图像时,它会引发错误
File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py", line 57,
in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
spider.py
# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
import scrapy
from scrapy.selector import Selector
from bs4 import BeautifulSoup
from deep_web2.items import DeepWeb2Item
import sys
reload(sys)
sys.setdefaultencoding('utf8')
class DeepSpider(Spider):
name = "deepSpider"
staer_urls=["http://hansamktkykr5yt4.onion/category/1/"]
bash_url = "http://hansamktkykr5yt4.onion"
headers = {
"Host": "hansamktkykr5yt4.onion",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/31.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-language": "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
"Connection": "keep-alive"
}
def start_requests(self):
yield scrapy.Request(url="http://hansamktkykr5yt4.onion/category/1/",headers=self.headers,
callback=self.parse_item)
def parse_item(self, response):
sel = Selector(response)
html = sel.extract()
html = html.encode('utf-8')
soup = BeautifulSoup(html,"lxml")
item_rows = soup.find_all("div",class_="row row-item")
for div in item_rows:
title = div.find_all("div",class_="item-details")[0].find_all("a")[0].get_text()
url = div.find_all("div",class_="item-details")[0].find_all("a")[0]['href']
address = div.find_all("small",class_="text-muted-666")[0].get_text()
price = div.find_all("div",class_="col-xs-3 text-right listing-price")[0].find_all("strong")[0].get_text()
img_url = div.find_all("img",class_="img-responsive img-thumbnail center-block")[0]['src']
view_num =div.find_all("div",class_="text-muted text-center")[0].find_all("small")[0].get_text()
link_ = self.bash_url+url
yield scrapy.Request(url=link_,headers=self.headers,meta={"title":title,"address":address,
"price":price,"img_url":img_url,
"view_num":view_num},callback=self.parse_fetch)
pageNum = soup.find_all("ul",class_="pagination")[0]
now = pageNum.find_all("li",class_="active")[0].get_text()
now = int(str(now).strip())
print now
for page_ in pageNum.find_all("li",class_=''):
number_ = page_.get_text()
try:
temp = int(str(number_).strip())
except:
continue
page_next = int(str(number_).strip())
if page_next==now+1:
url = self.bash_url+page_.find_all("a")[0]['href']
yield scrapy.Request(url=url,headers=self.headers,callback=self.parse_item)
def parse_fetch(self, response):
sel = Selector(response)
html = sel.extract()
html = html.encode('utf-8')
soup = BeautifulSoup(html,"lxml")
text = soup.find_all("p")[0].get_text()
item = DeepWeb2Item()
item['title'] = response.meta['title']
item['address'] = response.meta['address']
item['price'] = response.meta['price']
item['img_url'] = response.meta['img_url']
item['view_num'] = response.meta['view_num']
item['content'] = text
yield item
更多错误信息如下:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 587, in _
runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python27\lib\site-packages\scrapy\pipelines\media.py", line 62, in pr
ocess_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "C:\Python27\lib\site-packages\scrapy\pipelines\images.py", line 147, in
get_media_requests
return [Request(x) for x in item.get(self.images_urls_field, [])]
File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py", line 25,
in __init__
self._set_url(url)
File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py", line 57,
in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2017-03-15 08:42:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://hansam
ktkykr5yt4.onion/listing/63776/> (referer: http://hansamktkykr5yt4.onion/categor
y/1/)
2017-03-15 08:42:23 [scrapy.core.scraper] ERROR: Error processing {'address': u'
Ships from: Netherlands',
回溯(最近一次呼叫最后一次):
文件“C:\Python27\lib\site packages\twisted\internet\defer.py”,第587行,在_
运行回调
current.result=回调(current.result,*args,**kw)
文件“C:\Python27\lib\site packages\scrapy\pipelines\media.py”,第62行,在pr中
加工项目
请求=arg_to_iter(self.get_media_请求(项目、信息))
文件“C:\Python27\lib\site packages\scrapy\pipelines\images.py”,第147行,在
获取媒体请求
return[item.get(self.images\u url\u字段,[])中x的请求(x)]
文件“C:\Python27\lib\site packages\scrapy\http\request\\uuuu init\uuuuu.py”,第25行,
在初始化中__
自我设置url(url)
文件“C:\Python27\lib\site packages\scrapy\http\request\\uuuu init\uuuuu.py”,第57行,
in\u set\u url
raise VALUERROR('请求url中缺少方案:%s'%self.\u url)
ValueError:请求url:h中缺少方案
2017-03-15 08:42:23[刮屑核心引擎]调试:爬网(200)(参考:http://hansamktkykr5yt4.onion/categor
y/1/)
2017-03-15 08:42:23[scrapy.core.scraper]错误:处理{'address':u'时出错
来自荷兰的船舶,
您的spider start\u URL必须是一个列表,如下所示:
start_urls = ["https://www.google.com/"]
实际上,字符串被解释为字符列表,当spider尝试获取第一个元素时,它会获取第一个字母“h”.My spider start\u URL是一个列表,但错误仍然存在come@Wnj尝试用
start\u url
替换staer\u ulrs
它是立即停止还是在创建一些请求后停止?您可以提供shell的输出吗?顺便说一句:def start\u requests()中的请求
)
告诉Scrapy忽略我们的start\u url
。也许这有助于打印传递给请求的所有url=
的内容和type()
,例如:link=self.bash\u url+url产生Scrapy.Request(url=link,…)它没有立即停止,事实上,它可以输出项目的内容。staer\u url
用于什么?它不应该是start\u url
?你能提供一个更完整的回溯吗?只有最后两行,它没有说哪个请求
实例化失败。我提供了更多的回溯。因此它与你的img\u url相关
field,您似乎在ImagesPipeline设置中引用了该字段。IMAGES\u URL\u field
需要引用项目中包含URL列表的字段,而不是唯一URL。请尝试item['img\u URL']=[response.meta['img\u URL']