Python 另一个使用Scrapy的IMDB爬虫程序_Python_Scrapy_Imdb

Python 另一个使用Scrapy的IMDB爬虫程序

python scrapy

Python 另一个使用Scrapy的IMDB爬虫程序,python,scrapy,imdb,Python,Scrapy,Imdb,在尝试爬网imdb时遇到了一些问题，但在这里没有找到答案我正在尝试从页面中获取一些数据，如：使用以下代码： import scrapy from tutorial.items import MovieItem, CastItem class tutorialSpider(scrapy.Spider): name = "tutorial" allowed_domains = ["imdb.com"] # generate start_urls dynamically

在尝试爬网imdb时遇到了一些问题，但在这里没有找到答案

我正在尝试从页面中获取一些数据，如：使用以下代码：

import scrapy
from tutorial.items import MovieItem, CastItem

class tutorialSpider(scrapy.Spider):
    name = "tutorial"
    allowed_domains = ["imdb.com"]

    # generate start_urls dynamically
    def start_requests(self):
        for year in range(1950, 1951):
            for page in range(1, 3):
                yield scrapy.Request('http://www.imdb.com/search/title?release_date=%s&page=%s' % (year, page))


    def parse(self, response):
    self.wanted_num=50
        for sel in response.xpath("//*[contains(@class,'lister-item-content')]"):
            item = MovieItem()
            item['Title'] = sel.xpath('h3/a/text()').extract()[0]
            item['Rating'] = sel.xpath('div[@class="ratings-bar"]/div[@name="ir"]/strong/text()').extract()[0]
            item['Ranking']=sel.xpath('h3/span[@class="lister-item-index unbold text-primary"]/text()').extract()[0]
            item['ReleaseDate'] = sel.xpath('h3/span[@class="lister-item-year text-muted unbold"]/text()').extract()[0]
            item['MianPageUrl'] = "http://imdb.com"+sel.xpath('h3/a/@href').extract()[0]
            request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
            request.meta['item'] = item
            if int(item['Ranking']) >= self.wanted_num + 1:
                return
        yield request

因此，这里的问题是：

它似乎进入了无限循环，而试图在这些网页上爬行（301重定向），我不知道为什么(

我怀疑排名应该被删减，因为从第页开始它将被命名为“1”，所以我如何在字符串的末尾删减这一点呢

感谢您的帮助！

您可以发布爬网日志吗？您可以通过

scrapy crawl spider-s log\u FILE=output.log

或

scrapy crawl spider&>output.log

命令来发布爬网日志。您好，奇怪的事情发生了：-）现在它可以工作了！至于问题#2，我修正了它：

item['Ranking']=re.match（r'（^[0-9]+）、sel.xpath（'h3/span[@class=“lister item index unbold text primary”]/text（））.extract（）[0]。\uu str_uuuuuuuuuuuuuuuuuuuuuuuuu（）.strip（））。group（1）

FYI:scrapy选择器已经内置了

regex

快捷方式，例如：

sel.xpath（///div'）.re（'someregex'），谢谢！我不知道