Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/282.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/swift/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 痒痒的爬行蜘蛛不';不要爬过第一个登录页_Python_Scrapy_Web Crawler - Fatal编程技术网

Python 痒痒的爬行蜘蛛不';不要爬过第一个登录页

Python 痒痒的爬行蜘蛛不';不要爬过第一个登录页,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我是新来的刮痧,我正在进行刮痧练习,我正在使用爬行蜘蛛。 虽然Scrapy框架工作得很好,并且遵循相关链接,但我似乎无法让爬行蜘蛛抓取第一个链接(主页/登录页)。相反,它直接去刮取规则确定的链接,但不刮取链接所在的登录页。我不知道如何解决这个问题,因为不建议覆盖爬行器的解析方法。修改follow=True/False也不会产生任何好的结果。以下是代码片段: class DownloadSpider(CrawlSpider): name = 'downloader' allowed

我是新来的刮痧,我正在进行刮痧练习,我正在使用爬行蜘蛛。 虽然Scrapy框架工作得很好,并且遵循相关链接,但我似乎无法让爬行蜘蛛抓取第一个链接(主页/登录页)。相反,它直接去刮取规则确定的链接,但不刮取链接所在的登录页。我不知道如何解决这个问题,因为不建议覆盖爬行器的解析方法。修改follow=True/False也不会产生任何好的结果。以下是代码片段:

class DownloadSpider(CrawlSpider):
    name = 'downloader'
    allowed_domains = ['bnt-chemicals.de']
    start_urls = [
        "http://www.bnt-chemicals.de"        
        ]
    rules = (   
        Rule(SgmlLinkExtractor(aloow='prod'), callback='parse_item', follow=True),
        )
    fname = 1

    def parse_item(self, response):
        open(str(self.fname)+ '.txt', 'a').write(response.url)
        open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
        open(str(self.fname)+ '.txt', 'a').write('\n')
        open(str(self.fname)+ '.txt', 'a').write(response.body)
        open(str(self.fname)+ '.txt', 'a').write('\n')
        self.fname = self.fname + 1

有很多方法可以做到这一点,但最简单的方法之一是实现
解析\u start\u url
,然后修改
start\u url

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

class DownloadSpider(CrawlSpider):
    name = 'downloader'
    allowed_domains = ['bnt-chemicals.de']
    start_urls = ["http://www.bnt-chemicals.de/tunnel/index.htm"]
    rules = (
        Rule(SgmlLinkExtractor(allow='prod'), callback='parse_item', follow=True),
        )
    fname = 1

    def parse_start_url(self, response):
        return self.parse_item(response)


    def parse_item(self, response):
        open(str(self.fname)+ '.txt', 'a').write(response.url)
        open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
        open(str(self.fname)+ '.txt', 'a').write('\n')
        open(str(self.fname)+ '.txt', 'a').write(response.body)
        open(str(self.fname)+ '.txt', 'a').write('\n')
        self.fname = self.fname + 1

只需将回调更改为
parse_start_url
,并覆盖它:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class DownloadSpider(CrawlSpider):
    name = 'downloader'
    allowed_domains = ['bnt-chemicals.de']
    start_urls = [
        "http://www.bnt-chemicals.de",
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow='prod'), callback='parse_start_url', follow=True),
    )
    fname = 0

    def parse_start_url(self, response):
        self.fname += 1
        fname = '%s.txt' % self.fname

        with open(fname, 'w') as f:
            f.write('%s, %s\n' % (response.url, response.meta.get('depth', 0)))
            f.write('%s\n' % response.body)

谢谢这解决了问题。如果不回调parse\u start\u url,它是否仍然有效?如果是,什么时候调用parse_start_url?@JasonYouk
parse_start_url
是爬行蜘蛛中的一个抽象/虚拟方法。这里它被覆盖了,是的。这把它修好了。谢谢您拼错了
allow
参数