在Python中抛出NotImplementedError_Python_Python 3.x_Scrapy_Web Crawler_Webscarab

在Python中抛出NotImplementedError

python python-3.x scrapy web-crawler

在Python中抛出NotImplementedError,python,python-3.x,scrapy,web-crawler,webscarab,Python,Python 3.x,Scrapy,Web Crawler,Webscarab,当我试图运行我的代码时，我遇到了这个问题，我已经定义了一个实时请求，但是仍然无法工作。有人知道如何用python处理这个问题吗？在这种情况下，站点地图有多重要？提前谢谢 import logging import re from urllib.parse import urljoin, urlparse from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy import Request from scrapy.spi

当我试图运行我的代码时，我遇到了这个问题，我已经定义了一个实时请求，但是仍然无法工作。有人知道如何用python处理这个问题吗？在这种情况下，站点地图有多重要？提前谢谢

import logging
import re
from urllib.parse import urljoin, urlparse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy import Request
from scrapy.spiders import SitemapSpider
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.shell import inspect_response
from sqlalchemy.orm import sessionmaker
from content.spiders.templates.sitemap_template import ModSitemapSpider
from content.models import db_connect, create_db_table, Articles
from content.items import ContentItems
from content.item_functions import (process_item,
                                process_singular_item,
                                process_date_item,
                                process_array_item,
                                process_plural_texts,
                                process_external_links,
                                process_article_text)

HEADER_XPATH = ['//h1[@class="article-title"]//text()']
AUTHOR_XPATH = ['//span[@class="cnnbyline"]//text()',
            '//span[@class="byline"]//text()']
PUBDATE_XPATH = ['//span[@class="cnnDateStamp"]//text()']
TAGS_XPATH = ['']
CATEGORY_XPATH = ['']
TEXT = ['//div[@id="storytext"]//text()',
    '//div[@id="storycontent"]//p//text()']
INTERLINKS = ['//span[@class="inStoryHeading"]//a/@href']
DATE_FORMAT_STRING = '%Y-%m-%d'


class CNNnewsSpider(ModSitemapSpider):

    name = 'cnn'
    allowed_domains = ["cnn.com"]
    sitemap_urls = ["http://edition.cnn.com/sitemaps/sitemap-news.xml"]


def parse(self, response):
    items = []
    item = ContentItems()
    item['title'] = process_singular_item(self, response, HEADER_XPATH, single=True)
    item['resource'] = urlparse(response.url).hostname
    item['author'] = process_array_item(self, response, AUTHOR_XPATH, single=False)
    item['pubdate'] = process_date_item(self, response, PUBDATE_XPATH, DATE_FORMAT_STRING, single=True)
    item['tags'] = process_plural_texts(self, response, TAGS_XPATH, single=False)
    item['category'] = process_array_item(self, response, CATEGORY_XPATH, single=False)
    item['article_text'] = process_article_text(self, response, TEXT)
    item['external_links'] = process_external_links(self, response, INTERLINKS, single=False)
    item['link'] = response.url
    items.append(item)
    return items

这是我的文本结果：

File "/home/nik/project/lib/python3.5/site-      packages/scrapy/spiders/__init__.py", line 76, in parse
raise NotImplementedError
NotImplementedError
2016-10-17 18:48:04 [scrapy] DEBUG: Redirecting (302) to <GET     http://edition.cnn.com/2016/10/15/opinions/the-black-panthers-heirs-after-50-     years-joseph/index.html> from <GET http://www.cnn.com/2016/10/15/opinions/the-     black-panthers-heirs-after-50-years-joseph/index.html>
2016-10-17 18:48:04 [scrapy] DEBUG: Redirecting (302) to <GET   http://edition.cnn.com/2016/10/15/africa/montreal-climate-change-hfc-  kigali/index.html> from <GET http://www.cnn.com/2016/10/15/africa/montreal-  climate-change-hfc-kigali/index.html>
2016-10-17 18:48:04 [scrapy] DEBUG: Redirecting (302) to <GET http://edition.cnn.com/2016/10/14/middleeast/battle-for-mosul-hawija-iraq/index.html> from <GET http://www.cnn.com/2016/10/14/middleeast/battle-for-mosul-hawija-iraq/index.html>
2016-10-17 18:48:04 [scrapy] ERROR: Spider error processing <GET    http://edition.cnn.com/2016/10/15/politics/donald-trump-hillary-clinton-drug-    test/index.html> (referer: http://edition.cnn.com/sitemaps/sitemap-news.xml)
Traceback (most recent call last):
File "/home/nik/project/lib/python3.5/site-   packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/nik/project/lib/python3.5/site-   packages/scrapy/spiders/__init__.py", line 76, in parse
raise NotImplementedError

文件“/home/nik/project/lib/python3.5/site-packages/scrapy/spider/_init__.py”，第76行，在语法分析中
引发未实现的错误
未实现错误
2016-10-17 18:48:04[scrapy]调试：重定向（302）到
2016-10-17 18:48:04[scrapy]调试：重定向（302）到
2016-10-17 18:48:04[scrapy]调试：重定向（302）到
2016-10-17 18:48:04[刮屑]错误：蜘蛛错误处理（参考：http://edition.cnn.com/sitemaps/sitemap-news.xml)
回溯（最近一次呼叫最后一次）：
文件“/home/nik/project/lib/python3.5/site-packages/twisted/internet/defer.py”，第587行，在runCallbacks中
current.result=回调（current.result，*args，**kw）
文件“/home/nik/project/lib/python3.5/site-packages/scrapy/spiders/__init__.py”，第76行，在语法分析中
引发未实现的错误

引发异常是因为您的类

CNNnewsSpider

没有重写

scrapy.BaseSpider

中的方法

parse（）

。尽管您在粘贴的代码中定义了

parse（）

方法，但由于缩进的原因，它没有包含在

CNNnewsSpider

中：相反，它被定义为一个独立函数。您需要按如下方式修复缩进：

class CNNnewsSpider(ModSitemapSpider):
    name = 'cnn'
    allowed_domains = ["cnn.com"]
    sitemap_urls = ["http://edition.cnn.com/sitemaps/sitemap-news.xml"]

    def parse(self, response):
        items = []
        item = ContentItems()
        item['title'] = process_singular_item(self, response, HEADER_XPATH, single=True)
        item['resource'] = urlparse(response.url).hostname
        item['author'] = process_array_item(self, response, AUTHOR_XPATH, single=False)
        item['pubdate'] = process_date_item(self, response, PUBDATE_XPATH, DATE_FORMAT_STRING, single=True)
        item['tags'] = process_plural_texts(self, response, TAGS_XPATH, single=False)
        item['category'] = process_array_item(self, response, CATEGORY_XPATH, single=False)
        item['article_text'] = process_article_text(self, response, TEXT)
        item['external_links'] = process_external_links(self, response, INTERLINKS, single=False)
        item['link'] = response.url
        items.append(item)
        return items

通常，您应该将相关文本作为文本发布，而不是作为屏幕截图的链接发布。@khelwood，谢谢您的建议。您真的需要所有这些导入来复制此问题吗？请阅读您是否已前往scrapy支持论坛询问？Stackoverflow不是为产品支持而设计的。导入是否来自项目并不重要。关键是，您应该提供最少的代码来复制问题。许多人不会费心去运行代码，因为这些代码包含大量他们可能没有安装的导入。您需要花点时间删除所有可以删除的内容，同时仍然复制问题。这对你和回答你问题的人都一样重要；它可以帮助你缩小代码范围，帮助你找到答案。非常感谢你关心@neftes，我这样做了，但我仍然得到了“NotImplementedError”的结果：（（错误：蜘蛛错误处理刮擦爬网“类的名称”你实际在“类的名称”中使用了什么？我没有更改任何内容这些是我的代码，仅此而已。）