Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Xpath错误-爬行器错误处理_Xpath_Scrapy - Fatal编程技术网

Xpath错误-爬行器错误处理

Xpath错误-爬行器错误处理,xpath,scrapy,Xpath,Scrapy,因此,我正在构建这个爬行器,它可以很好地爬行,因为我可以登录到shell,浏览HTML页面并测试Xpath查询 不知道我做错了什么。任何帮助都将不胜感激。我重新安装了Twisted,但什么都没有 我的蜘蛛看起来像这样- from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from spider_scrap.items import spiderItem class spider(B

因此,我正在构建这个爬行器,它可以很好地爬行,因为我可以登录到shell,浏览HTML页面并测试Xpath查询

不知道我做错了什么。任何帮助都将不胜感激。我重新安装了Twisted,但什么都没有

我的蜘蛛看起来像这样-

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem

class spider(BaseSpider):
name="spider1"
#allowed_domains = ["example.com"]
start_urls = [                  
              "http://www.example.com"
            ]

def parse(self, response):
 items = [] 
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//*[@id="search_results"]/div[1]/div')

    for site in sites:
        item = spiderItem()
        item['title'] = site.select('div[2]/h2/a/text()').extract                            item['author'] = site.select('div[2]/span/a/text()').extract    
        item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()     
    items.append(item)
    return items
当我运行spider-scrapy crawl Spider1时,我得到以下错误-

    2012-09-25 17:56:12-0400 [scrapy] DEBUG: Enabled item pipelines:
    2012-09-25 17:56:12-0400 [Spider1] INFO: Spider opened
    2012-09-25 17:56:12-0400 [Spider1] INFO: Crawled 0 pages (at 0 pages/min), scraped  0 items (at 0 items/min)
    2012-09-25 17:56:12-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2012-09-25 17:56:12-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
   2012-09-25 17:56:15-0400 [Spider1] DEBUG: Crawled (200) <GET http://www.example.com>  (refere
   r: None)
   2012-09-25 17:56:15-0400 [Spider1] ERROR: Spider error processing <GET    http://www.example.com
    s>
    Traceback (most recent call last):
      File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse
        raise NotImplementedError
    exceptions.NotImplementedError:

     2012-09-25 17:56:15-0400 [Spider1] INFO: Closing spider (finished)
     2012-09-25 17:56:15-0400 [Spider1] INFO: Dumping spider stats:
    {'downloader/request_bytes': 231,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 186965,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 9, 25, 21, 56, 15, 326000),
     'scheduler/memory_enqueued': 1,
     'spider_exceptions/NotImplementedError': 1,
     'start_time': datetime.datetime(2012, 9, 25, 21, 56, 12, 157000)}
      2012-09-25 17:56:15-0400 [Spider1] INFO: Spider closed (finished)
      2012-09-25 17:56:15-0400 [scrapy] INFO: Dumping global stats:
    {}
2012-09-2517:56:12-0400[scrapy]调试:启用的项目管道:
2012-09-25 17:56:12-0400[Spider1]信息:Spider已打开
2012-09-25 17:56:12-0400[蜘蛛侠1]信息:爬网0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2012-09-25 17:56:12-0400[scrapy]调试:Telnet控制台在0.0.0.0上侦听:6023
2012-09-25 17:56:12-0400[scrapy]调试:Web服务侦听0.0.0.0:6080
2012-09-25 17:56:15-0400[Spider1]调试:爬网(200)(参考
r:无)
2012-09-25 17:56:15-0400[蜘蛛侠1]错误:蜘蛛侠错误处理
回溯(最近一次呼叫最后一次):
文件“C:\Python27\lib\site packages\twisted\internet\base.py”,第1178行,在mainLoop中
self.rununtlcurrent()
文件“C:\Python27\lib\site packages\twisted\internet\base.py”,第800行,在rununtlcurrent中
call.func(*call.args,**call.kw)
回调中第368行的文件“C:\Python27\lib\site packages\twisted\internet\defer.py”
自启动返回(结果)
文件“C:\Python27\lib\site packages\twisted\internet\defer.py”,第464行,在startRunCallbacks中
self.\u runCallbacks()
---  ---
文件“C:\Python27\lib\site packages\twisted\internet\defer.py”,第551行,在运行回调中
current.result=回调(current.result,*args,**kw)
文件“C:\Python27\lib\site packages\scrapy\spider.py”,第62行,在parse中
引发未实现的错误
异常。未实现错误:
2012-09-25 17:56:15-0400[蜘蛛侠1]信息:正在关闭蜘蛛侠(已完成)
2012-09-25 17:56:15-0400[蜘蛛侠1]信息:转储蜘蛛统计信息:
{'downloader/request_bytes':231,
“下载程序/请求计数”:1,
“downloader/request\u method\u count/GET”:1,
“downloader/response_bytes”:186965,
“下载程序/响应计数”:1,
“下载程序/响应状态\计数/200”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2012,9,25,21,56,15,326000),
“调度程序/内存已排队”:1,
“spider_异常/未实现错误”:1,
“开始时间”:datetime.datetime(2012,9,25,21,56,12,157000)}
2012-09-25 17:56:15-0400[蜘蛛侠1]信息:蜘蛛侠关闭(完成)
2012-09-25 17:56:15-0400[刮皮]信息:倾销全球统计数据:
{}

Leo是对的,缩进不正确。您的脚本中可能有一些制表符和空格混在一起,因为您自己粘贴了一些代码并键入了其他代码,并且编辑器允许在同一文件中同时使用制表符和空格。将所有选项卡转换为空格,使其更像:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem

class spider(BaseSpider):
    name = "spider1"
    start_urls = ["http://www.example.com"]

    def parse(self, response):
        items = []
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*[@id="search_results"]/div[1]/div')

        for site in sites:
            item = spiderItem()
            item['title'] = site.select('div[2]/h2/a/text()').extract
            item['author'] = site.select('div[2]/span/a/text()').extract
            item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
            items.append(item)

        return items

您的解析方法是类外代码,请使用下面提到的代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem

class spider(BaseSpider):
    name="spider1"
    allowed_domains = ["example.com"]
    start_urls = [
      "http://www.example.com"
     ]

    def parse(self, response):
        items = []
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*[@id="search_results"]/div[1]/div')

        for site in sites:
            item = spiderItem()
            item['title'] = site.select('div[2]/h2/a/text()').extract                            item['author'] = site.select('div[2]/span/a/text()').extract
            item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
        items.append(item)
        return items

对于所有面临此问题的人,请确保您没有像我一样重命名parse()方法:

class CakeSpider(CrawlSpider):
    name            = "cakes"
    allowed_domains = ["cakes.com"]
    start_urls      = ["http://www.cakes.com/catalog"]

    def parse(self, response): #this should be 'parse' and nothing else

        #yourcode#
否则会抛出相同的错误:

...
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse
    raise NotImplementedError
    exceptions.NotImplementedError:

我花了大约三个小时试图找出-。-

spider.py中的哪一行是第62行?第62行-def parse(self,response):raise notimplementederror缩进在您发布的代码片段中不正确。你可能想在你的脚本中仔细检查一下。就像Leo说的,修复缩进问题,发布你实际运行的代码,然后返回给我们。我只是缩进(间隔向上),这样就可以得到代码块了。因此,空间。