Python 3.x Python Scrapy:从本地文件爬网：内容类型未定义_Python 3.x_File_Scrapy

Python 3.x Python Scrapy:从本地文件爬网：内容类型未定义

python-3.x file scrapy

Python 3.x Python Scrapy:从本地文件爬网：内容类型未定义,python-3.x,file,scrapy,Python 3.x,File,Scrapy,我想让Scrapy抓取本地html文件，但我被卡住了，因为标题缺少内容类型字段。我在这里遵循了教程：所以基本上，我将scrapy指向本地URL，例如file:///Users/felix/myfile.html 但是，scrapy将崩溃，因为它看起来（在MacOS上）生成的响应对象不包含必需的字段内容类型 /Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/felix/IdeaProjects/news-p

我想让Scrapy抓取本地html文件，但我被卡住了，因为标题缺少内容类型字段。我在这里遵循了教程：所以基本上，我将scrapy指向本地URL，例如

file:///Users/felix/myfile.html

但是，scrapy将崩溃，因为它看起来（在MacOS上）生成的响应对象不包含必需的字段

内容类型

/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/felix/IdeaProjects/news-please/newsplease/__init__.py
[scrapy.core.scraper:158|ERROR] Spider error processing <GET file:///Users/felix/IdeaProjects/news-please/newsplease/0a2199bdcef84d2bb2f920cf042c5919> (referer: None)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/felix/IdeaProjects/news-please/newsplease/crawler/spiders/download_crawler.py", line 33, in parse
    if not self.helper.parse_crawler.content_type(response):
  File "/Users/felix/IdeaProjects/news-please/newsplease/helper_classes/parse_crawler.py", line 116, in content_type
    if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
AttributeError: 'NoneType' object has no attribute 'decode'

/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6/Users/felix/IdeaProjects/news please/news please/\uuuuu init\uuuuuu.py
[scrapy.core.scraper:158 | ERROR]蜘蛛错误处理（参考：无）
回溯（最近一次呼叫最后一次）：
iter_errback中的文件“/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site packages/scrapy/utils/defer.py”，第102行
下一个（it）
文件“/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site packages/scrapy/spidermiddleware/offsite.py”，第29行，进程中输出
对于结果中的x：
文件“/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site packages/scrapy/spidermiddleware/referer.py”，第22行，在
返回（_set_referer（r）表示结果中的r或（））
文件“/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site packages/scrapy/spidermiddleware/urlength.py”，第37行，在
返回（结果中的r表示r或（）如果_过滤器（r））
文件“/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site packages/scrapy/spidermiddleware/depth.py”，第58行，在
返回（结果中的r表示r或（）如果_过滤器（r））
文件“/Users/felix/IdeaProjects/news please/newsleep/crawler/spiders/download_crawler.py”，第33行，在parse中
如果不是self.helper.parse\u crawler.content\u类型（响应）：
文件“/Users/felix/IdeaProjects/news please/newsleep/helper\u classes/parse\u crawler.py”，第116行，内容类型
如果没有重新匹配（'text/html'，response.headers.get（'Content-Type'）。解码（'utf-8'）：
AttributeError:“非类型”对象没有属性“解码”

有人建议运行一个简单的http服务器，请参阅，但这不是一个选项，主要是因为运行另一个服务器会造成开销

我首先需要使用scrapy，因为我们有一个更大的框架使用scrapy。我们计划将从本地文件爬网的功能添加到该框架中。但是，由于存在一些关于如何从本地文件爬网的问题（请参见前面的链接），我认为这个问题是大家都感兴趣的。

您实际上可以在函数

定义内容类型（self，response）中将scrapy更改为alwaysreturn True

在

新闻请/helper\u classes/parse\u crawler.py

中，如果它来自本地存储

新文件将如下所示：

def content_type(self, response):
    """
    Ensures the response is of type

    :param obj response: The scrapy response
    :return bool: Determines wether the response is of the correct type
    """
    if response.url.startswith('file:///'):
        return True
    if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
        self.log.warn("Dropped: %s's content is not of type "
                      "text/html but %s", response.url,
                      response.headers.get('Content-Type'))
        return False
    else:
        return True

实际上，您可以在

newsplease/helper\u classes/parse\u crawler.py中的def content\u type（self，response）
函数中将scrapy更改为alwaysreturn True

新文件将如下所示：
def content_type(self, response):
    """
    Ensures the response is of type

    :param obj response: The scrapy response
    :return bool: Determines wether the response is of the correct type
    """
    if response.url.startswith('file:///'):
        return True
    if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
        self.log.warn("Dropped: %s's content is not of type "
                      "text/html but %s", response.url,
                      response.headers.get('Content-Type'))
        return False
    else:
        return True

是的，这是一个选项，但我不想更改外部库，因为我需要将它们打包到我们的框架中。然后您可以在中报告错误。是的，这是一个选项，但我不想更改外部库，因为我需要将它们打包到我们的框架中。然后您可以在中报告错误。