Python Scrapy爬行爬行器仅接触开始URL_Python_Scrapy

Python Scrapy爬行爬行器仅接触开始URL

python scrapy

Python Scrapy爬行爬行器仅接触开始URL,python,scrapy,Python,Scrapy,我发现我的CrawlSpider只爬行start\u URL，而没有进一步爬行下面是我的代码 import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class ExampleSpider(CrawlSpider): name = 'example' allowed_domains = ['holy-bible-eng

我发现我的

CrawlSpider

只爬行

start\u URL

，而没有进一步爬行

下面是我的代码

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['holy-bible-eng']
    start_urls = ['file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml']

    rules = (
        Rule(LinkExtractor(allow=r'OEBPS'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        return response

下面是我的

file:///G:/holy-圣经英文/OEBPS/bible目录.xhtml

start\u URL


（ppppp来自：
follow是一个布尔值，用于指定是否应该从
使用此规则提取的每个响应如果回调为None，则跟随
默认为True，否则默认为False
您不能同时使用callback
和follow=True
规则。它只会侦听回调，不会进一步
因此，爬行蜘蛛
规则背后的主要思想是，它可以找到要遵循的链接和要实际提取的链接
现在，检查“本地”文件并不是最好的办法，因为只需创建一个简单的脚本即可
另一个错误是您正在设置allowed\u domains
类变量，该变量指定它应该接受哪些域。所有其他的都被拒绝，这只适用于互联网上的链接。如果您不想拒绝域，或者根本不使用域（您的案例），请删除该变量。
感谢您的回复，我刚刚注释了允许\u域
，它开始跟随链接！很高兴它有帮助@如果答案对你有帮助，请记住接受它。