Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/358.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 502使用剪贴脚本时出错_Python_Web Scraping_Scrapy - Fatal编程技术网

Python 502使用剪贴脚本时出错

Python 502使用剪贴脚本时出错,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,在这里搜刮菜鸟。我使用Scrapy从单个站点获取一组数据。当我运行脚本时,它可以正常工作几分钟,但随后会变慢,几乎停止,并不断抛出以下两个错误,以及它试图刮取的不同URL: 2013-07-20 14:15:17-0700 [billboard_spider] DEBUG: Retrying <GET http://www.billboard.com/charts/1981-01-17/hot-100> (failed 1 times): Getting http://www.bil

在这里搜刮菜鸟。我使用Scrapy从单个站点获取一组数据。当我运行脚本时,它可以正常工作几分钟,但随后会变慢,几乎停止,并不断抛出以下两个错误,以及它试图刮取的不同URL:

2013-07-20 14:15:17-0700 [billboard_spider] DEBUG: Retrying <GET http://www.billboard.com/charts/1981-01-17/hot-100> (failed 1 times): Getting http://www.billboard.com/charts/1981-01-17/hot-100 took longer than 180 seconds.

2013-07-20 14:16:56-0700 [billboard_spider] DEBUG: Crawled (502) <GET http://www.billboard.com/charts/1981-01-17/hot-100> (referer: None) 

蜘蛛为我工作,毫无问题地抓取数据。所以,正如@Tiago所设想的,你被禁止了

在将来阅读,并适当调整你的刮擦设置。我会先尝试增加下载延迟,然后轮换IP地址

也可以考虑切换到使用真正的自动化浏览器,比如:

另外,请查看是否可以从RSS XML源获取日期:


希望这会有帮助。

你会被禁止吗?不客气。如果您决定改用硒,请告诉我您是否需要帮助。
import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class BillBoardItem(Item):
    date = Field()
    song = Field()
    artist = Field()


BASE_URL = "http://www.billboard.com/charts/%s/hot-100"


class BillBoardSpider(BaseSpider):
    name = "billboard_spider"
    allowed_domains = ["billboard.com"]

    def __init__(self):
        date = datetime.date(year=1975, month=12, day=27)

        self.start_urls = []
        while True:
            if date.year >= 2013:
                break

            self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
            date += datetime.timedelta(days=7)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]

        songs = hxs.select('//div[@class="listing chart_listing"]/article')
        item = BillBoardItem()
        item['date'] = date
        for song in songs:
            try:
                track = song.select('.//header/h1/text()').extract()[0]
                track = track.rstrip()
                item['song'] = track
                item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
                break
            except:
                continue 

         yield item