Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/355.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python scraper在解析1链接后结束_Python_Web Scraping - Fatal编程技术网

Python scraper在解析1链接后结束

Python scraper在解析1链接后结束,python,web-scraping,Python,Web Scraping,我一直在写这个网页刮板,我不明白为什么它只是结束。代码如下: import scrapy, MySQLdb, urllib from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from scrapy import Request class MyItems(scrapy.Item): topLinks = scrapy.

我一直在写这个网页刮板,我不明白为什么它只是结束。代码如下:

import scrapy, MySQLdb, urllib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy import Request


class MyItems(scrapy.Item):
    topLinks = scrapy.Field()
    artists = scrapy.Field()

class mp3Spider(CrawlSpider):
    name = 'mp3_scraper'
    allowed_domains = [
        'example.com'
    ]
    start_urls = [
        'http://www.example.com'
    ]

    def __init__(self, *a, **kw):
        super(mp3Spider, self).__init__(*a, **kw)

        self.item = MyItems()

    def parse(self, response):
        f = open('topLinks', 'w')
        self.item['topLinks'] = response.xpath("//div[contains(@class, 'en')]/a[contains(@class, 'hash')]/@href").extract()

        for x in range(len(self.item['topLinks'])):
            self.item['topLinks'][x] = 'http://www.example.com' + self.item['topLinks'][x]

        for x in range(len(self.item['topLinks'])):
            f.write(format(self.item['topLinks'][x]).encode('utf-8')+ '\n')
            yield Request(url=self.item['topLinks'][x], callback=self.parse_artists)

    def parse_artists(self, response):
        f = open('artists', 'w')
        self.item['artists'] = response.xpath("//ul[contains(@class, 'artist_list')]/li/a/text()").extract()

        for x in range(len(self.item['artists'])):
            f.write(format(self.item['artists'][x]).encode('utf-8') + '\n')
所以两个解析函数都得到了我需要的信息,但是parse_美术师只解析一个链接。parse函数获取我需要的所有链接,我可以看到它是这样做的,因为我将它们打印到了一个文件中。因此,假设它抓取链接:example.com/artists/a、example.com/artists/b等。Parse artists将只抓取example.com/artists/a,然后停止。任何帮助都将不胜感激,谢谢-萨姆

编辑:输出日志-

C:\Python27\python.exe C:/Users/sam/PycharmProjects/mp3_scraper/mp3_scraper/mp3_scraper/main.py
2014-09-13 12:28:24-0400 [scrapy] INFO: Scrapy 0.24.2 started (bot: mp3_scraper)
2014-09-13 12:28:24-0400 [scrapy] INFO: Optional features available: ssl, http11
2014-09-13 12:28:24-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mp3_scraper.spiders', 'SPIDER_MODULES': ['mp3_scraper.spiders'], 'BOT_NAME': 'mp3_scraper'}
2014-09-13 12:28:24-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled item pipelines: 
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Spider opened
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/> (referer: None)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/z/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/0..9/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/w/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/x/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/u/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/q/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/v/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/y/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/t/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/o/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/p/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/r/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/n/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/s/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/l/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/h/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/k/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/i/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/g/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/m/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/j/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/f/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/e/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/c/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/d/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/b/> (referer: http://www.myfreemp3.cc/artists/)
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Closing spider (finished)
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 10106,
     'downloader/request_count': 27,
     'downloader/request_method_count/GET': 27,
     'downloader/response_bytes': 887850,
     'downloader/response_count': 27,
     'downloader/response_status_count/200': 27,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 9, 13, 16, 28, 28, 908000),
     'log_count/DEBUG': 29,
     'log_count/INFO': 7,
     'request_depth_max': 1,
     'response_received_count': 27,
     'scheduler/dequeued': 27,
     'scheduler/dequeued/memory': 27,
     'scheduler/enqueued': 27,
     'scheduler/enqueued/memory': 27,
     'start_time': datetime.datetime(2014, 9, 13, 16, 28, 25, 315000)}
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Spider closed (finished)

Process finished with exit code 0
以w模式打开艺术家文件,如果该文件已存在,则会截断该文件。因此,在爬行器完成后,文件中只保留最后一个刮取的项目

您应该打开附加模式a的文件以修复此问题:

def parse_artists(self, response):
    f = open('artists', 'a')
    ...

您能否将执行spider时Scrapy生成的日志输出添加到问题中?