Xpath错误-爬行器错误处理
因此,我正在构建这个爬行器,它可以很好地爬行,因为我可以登录到shell,浏览HTML页面并测试Xpath查询 不知道我做错了什么。任何帮助都将不胜感激。我重新安装了Twisted,但什么都没有 我的蜘蛛看起来像这样-Xpath错误-爬行器错误处理,xpath,scrapy,Xpath,Scrapy,因此,我正在构建这个爬行器,它可以很好地爬行,因为我可以登录到shell,浏览HTML页面并测试Xpath查询 不知道我做错了什么。任何帮助都将不胜感激。我重新安装了Twisted,但什么都没有 我的蜘蛛看起来像这样- from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from spider_scrap.items import spiderItem class spider(B
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem
class spider(BaseSpider):
name="spider1"
#allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="search_results"]/div[1]/div')
for site in sites:
item = spiderItem()
item['title'] = site.select('div[2]/h2/a/text()').extract item['author'] = site.select('div[2]/span/a/text()').extract
item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
items.append(item)
return items
当我运行spider-scrapy crawl Spider1时,我得到以下错误-
2012-09-25 17:56:12-0400 [scrapy] DEBUG: Enabled item pipelines:
2012-09-25 17:56:12-0400 [Spider1] INFO: Spider opened
2012-09-25 17:56:12-0400 [Spider1] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-09-25 17:56:12-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-09-25 17:56:12-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-09-25 17:56:15-0400 [Spider1] DEBUG: Crawled (200) <GET http://www.example.com> (refere
r: None)
2012-09-25 17:56:15-0400 [Spider1] ERROR: Spider error processing <GET http://www.example.com
s>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 368, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-09-25 17:56:15-0400 [Spider1] INFO: Closing spider (finished)
2012-09-25 17:56:15-0400 [Spider1] INFO: Dumping spider stats:
{'downloader/request_bytes': 231,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186965,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 9, 25, 21, 56, 15, 326000),
'scheduler/memory_enqueued': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2012, 9, 25, 21, 56, 12, 157000)}
2012-09-25 17:56:15-0400 [Spider1] INFO: Spider closed (finished)
2012-09-25 17:56:15-0400 [scrapy] INFO: Dumping global stats:
{}
2012-09-2517:56:12-0400[scrapy]调试:启用的项目管道:
2012-09-25 17:56:12-0400[Spider1]信息:Spider已打开
2012-09-25 17:56:12-0400[蜘蛛侠1]信息:爬网0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2012-09-25 17:56:12-0400[scrapy]调试:Telnet控制台在0.0.0.0上侦听:6023
2012-09-25 17:56:12-0400[scrapy]调试:Web服务侦听0.0.0.0:6080
2012-09-25 17:56:15-0400[Spider1]调试:爬网(200)(参考
r:无)
2012-09-25 17:56:15-0400[蜘蛛侠1]错误:蜘蛛侠错误处理
回溯(最近一次呼叫最后一次):
文件“C:\Python27\lib\site packages\twisted\internet\base.py”,第1178行,在mainLoop中
self.rununtlcurrent()
文件“C:\Python27\lib\site packages\twisted\internet\base.py”,第800行,在rununtlcurrent中
call.func(*call.args,**call.kw)
回调中第368行的文件“C:\Python27\lib\site packages\twisted\internet\defer.py”
自启动返回(结果)
文件“C:\Python27\lib\site packages\twisted\internet\defer.py”,第464行,在startRunCallbacks中
self.\u runCallbacks()
--- ---
文件“C:\Python27\lib\site packages\twisted\internet\defer.py”,第551行,在运行回调中
current.result=回调(current.result,*args,**kw)
文件“C:\Python27\lib\site packages\scrapy\spider.py”,第62行,在parse中
引发未实现的错误
异常。未实现错误:
2012-09-25 17:56:15-0400[蜘蛛侠1]信息:正在关闭蜘蛛侠(已完成)
2012-09-25 17:56:15-0400[蜘蛛侠1]信息:转储蜘蛛统计信息:
{'downloader/request_bytes':231,
“下载程序/请求计数”:1,
“downloader/request\u method\u count/GET”:1,
“downloader/response_bytes”:186965,
“下载程序/响应计数”:1,
“下载程序/响应状态\计数/200”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2012,9,25,21,56,15,326000),
“调度程序/内存已排队”:1,
“spider_异常/未实现错误”:1,
“开始时间”:datetime.datetime(2012,9,25,21,56,12,157000)}
2012-09-25 17:56:15-0400[蜘蛛侠1]信息:蜘蛛侠关闭(完成)
2012-09-25 17:56:15-0400[刮皮]信息:倾销全球统计数据:
{}
Leo是对的,缩进不正确。您的脚本中可能有一些制表符和空格混在一起,因为您自己粘贴了一些代码并键入了其他代码,并且编辑器允许在同一文件中同时使用制表符和空格。将所有选项卡转换为空格,使其更像:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem
class spider(BaseSpider):
name = "spider1"
start_urls = ["http://www.example.com"]
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="search_results"]/div[1]/div')
for site in sites:
item = spiderItem()
item['title'] = site.select('div[2]/h2/a/text()').extract
item['author'] = site.select('div[2]/span/a/text()').extract
item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
items.append(item)
return items
您的解析方法是类外代码,请使用下面提到的代码
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem
class spider(BaseSpider):
name="spider1"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="search_results"]/div[1]/div')
for site in sites:
item = spiderItem()
item['title'] = site.select('div[2]/h2/a/text()').extract item['author'] = site.select('div[2]/span/a/text()').extract
item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
items.append(item)
return items
对于所有面临此问题的人,请确保您没有像我一样重命名parse()方法:
class CakeSpider(CrawlSpider):
name = "cakes"
allowed_domains = ["cakes.com"]
start_urls = ["http://www.cakes.com/catalog"]
def parse(self, response): #this should be 'parse' and nothing else
#yourcode#
否则会抛出相同的错误:
...
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse
raise NotImplementedError
exceptions.NotImplementedError:
我花了大约三个小时试图找出-。-spider.py中的哪一行是第62行?第62行-def parse(self,response):raise notimplementederror缩进在您发布的代码片段中不正确。你可能想在你的脚本中仔细检查一下。就像Leo说的,修复缩进问题,发布你实际运行的代码,然后返回给我们。我只是缩进(间隔向上),这样就可以得到代码块了。因此,空间。