Python IMDB电影剪辑使用scrapy提供空白csv

Python IMDB电影剪辑使用scrapy提供空白csv,python,web-scraping,scrapy,export-to-csv,Python,Web Scraping,Scrapy,Export To Csv,我得到的是空白的csv,尽管它没有显示任何代码错误。 它无法在网页中爬行 这是我在youtube上编写的代码:- import scrapy from Example.items import MovieItem class ThirdSpider(scrapy.Spider): name = "imdbtestspider" allowed_domains = ["imdb.com"] start_url = ('http://www.imdb.com/chart/top',) d

我得到的是空白的csv,尽管它没有显示任何代码错误。 它无法在网页中爬行

这是我在youtube上编写的代码:-

import scrapy

from Example.items import MovieItem

class ThirdSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = ["imdb.com"]
start_url = ('http://www.imdb.com/chart/top',)


  def parse(self,response):
    links = response.xpath('//tbody[@class="lister-list"]/tr/td[@class="titleColumn"]/a/@href').extract()
    i = 1
    for link in links:
        abs_url = response.urljoin(link)
        #
        url_next = '//*[@id="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()'
        rating = response.xpath(url_next).extact()
        if (i <= len(link)):
            i=i+1
            yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'rating': rating})

  def parse_indetail(self,response):
    item = MovieItem()
    #
    item['title'] = response.xpath('//div[@class="title_wrapper"])/h1/text()').extract[0][:-1]
    item['directors'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="director"]/a/span/text()').extract()[0]
    item['writers'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="creator"]/a/span/text()').extract()
    item['stars'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="actors"]/a/span/text()').extract()
    item['popularity'] = response.xpath('//div[@class="titleReviewBarSubItem"]/div/span/text()').extract()[2][21:-8]

    return item
import scrapy
从Example.items导入电影项目
第三类蜘蛛(刮毛蜘蛛):
name=“imdbtestspider”
允许的_域=[“imdb.com”]
开始url=('http://www.imdb.com/chart/top',)
def解析(自我,响应):
links=response.xpath('//tbody[@class=“lister list”]/tr/td[@class=“titleColumn”]/a/@href').extract()
i=1
对于链接中的链接:
abs_url=response.urljoin(链接)
#
url_next='/*[@id=“main”]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()
rating=response.xpath(url\u next.extact())

如果(i我已经测试了您提供的
xpath
,我不知道它们是错误的还是实际上是错误的

e、 g


加上你的XPath没有产生任何结果。

至于为什么你会得到错误,说0/页已爬网,尽管没有重新创建你的案例,我不得不假设你的页面迭代方法没有正确构建页面URL

我很难理解创建所有“follow links”的变量数组,然后使用len将它们发送到parse_indetail()的用法,但有几件事需要注意

  • 当您使用“meta”将项从一个函数传递到下一个函数时,尽管您的想法是正确的,但是您缺少了一些将其传递到的函数的实例化(为了简单起见,您还应该使用标准命名约定)
  • 应该是这样的

    def parse(self,response):
        # If you are going to capture an item at the first request, you must instantiate
        # your items class
        item = MovieItem()
        ....
        # You seem to want to pass ratings to the next function for itimization, so
        # you make sure that you have it listed in your items.py file and you set it
        item[rating] = response.xpath(PATH).extact() # Why did you ad the url_next? huh?
        ....
        # Standard convention for passing meta using call back is like this, this way
        # allows you to pass multiple itemized item gets passed
        yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'item': item})
    
      def parse_indetail(self,response):
        # Then you must initialize the meta again in the function your passing it to
        item = response.meta['item']
        # Then you can continue your scraping
    
  • 您不应该使页面迭代逻辑复杂化。您似乎了解它的工作原理,但需要帮助对这方面进行微调。我已经重新创建了您的用例并对其进行了优化
  • 注意两件事: Id parse()函数。我在这里所做的只是在链接中使用for循环,循环中的每个实例都引用href,并将urljoined href传递给解析器函数。给出您的用例,这就足够了。在有下一页的情况下,它只是为“next”创建一个变量不知何故,页面会返回到解析,它会一直这样做,直到找不到“下一个”页面为止


    第二,只有在HTML项目中有相同的标记和不同的内容时才使用xpath。这更多是个人观点,但我告诉人们xpath选择器就像手术刀,css选择器就像屠刀。使用手术刀可以非常精确,但它需要更多的时间,在许多情况下,使用css选择器t可能更容易o获得相同的结果。

    这是您可以尝试的另一种方法。我使用css选择器而不是xpath来减少脚本的详细程度

    import scrapy
    
    class ImbdsdpyderSpider(scrapy.Spider):
        name = 'imbdspider'
        start_urls = ['http://www.imdb.com/chart/top']
    
        def parse(self, response):
            for link in response.css(".titleColumn a[href^='/title/']::attr(href)").extract():
                yield scrapy.Request(response.urljoin(link),callback=self.get_info)
    
        def get_info(self, response):
            item = {}
            title = response.css(".title_wrapper h1::text").extract_first()
            item['title'] = ' '.join(title.split()) if title else None
            item['directors'] = response.css(".credit_summary_item h4:contains('Director') ~ a::text").extract()
            item['writers'] = response.css(".credit_summary_item h4:contains('Writer') ~ a::text").extract()
            item['stars'] = response.css(".credit_summary_item h4:contains('Stars') ~ a::text").extract()
            popularity = response.css(".titleReviewBarSubItem:contains('Popularity') .subText::text").extract_first()
            item['popularity'] = ' '.join(popularity.split()).strip("(") if popularity else None
            item['rating'] = response.css(".ratingValue span::text").extract_first()
            yield item
    

    有你正在做的工作的代码。如果你需要更具体的帮助,请提供所有代码(请相信示例中的
    。items
    是你的自定义代码吗?)。此外,start\u url是一个列表,因此需要放在方括号内,即start\u url=[www.abc.com,]
    #items.py file
    import scrapy
    
    
    class TestimbdItem(scrapy.Item):
        title = scrapy.Field()
        directors = scrapy.Field()
        writers = scrapy.Field()
        stars = scrapy.Field()
        popularity = scrapy.Field()
        rating = scrapy.Field()
    
    # The spider file
    import scrapy
    from testimbd.items import TestimbdItem
    
    class ImbdsdpyderSpider(scrapy.Spider):
        name = 'imbdsdpyder'
        allowed_domains = ['imdb.com']
        start_urls = ['http://www.imdb.com/chart/top']
    
        def parse(self, response):
            for href in response.css("td.titleColumn a::attr(href)").extract():
                yield scrapy.Request(response.urljoin(href),
                                     callback=self.parse_movie)
    
        def parse_movie(self, response):
            item = TestimbdItem()
            item['title'] = [ x.replace('\xa0', '')  for x in response.css(".title_wrapper h1::text").extract()][0]
            item['directors'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Director")]/following-sibling::a/text()').extract()
            item['writers'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Writers")]/following-sibling::a/text()').extract()
            item['stars'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Stars")]/following-sibling::a/text()').extract()
            item['popularity'] = response.css(".titleReviewBarSubItem span.subText::text")[2].re('([0-9]+)')
            item['rating'] = response.css(".ratingValue span::text").extract_first()
    
            yield item
    
    import scrapy
    
    class ImbdsdpyderSpider(scrapy.Spider):
        name = 'imbdspider'
        start_urls = ['http://www.imdb.com/chart/top']
    
        def parse(self, response):
            for link in response.css(".titleColumn a[href^='/title/']::attr(href)").extract():
                yield scrapy.Request(response.urljoin(link),callback=self.get_info)
    
        def get_info(self, response):
            item = {}
            title = response.css(".title_wrapper h1::text").extract_first()
            item['title'] = ' '.join(title.split()) if title else None
            item['directors'] = response.css(".credit_summary_item h4:contains('Director') ~ a::text").extract()
            item['writers'] = response.css(".credit_summary_item h4:contains('Writer') ~ a::text").extract()
            item['stars'] = response.css(".credit_summary_item h4:contains('Stars') ~ a::text").extract()
            popularity = response.css(".titleReviewBarSubItem:contains('Popularity') .subText::text").extract_first()
            item['popularity'] = ' '.join(popularity.split()).strip("(") if popularity else None
            item['rating'] = response.css(".ratingValue span::text").extract_first()
            yield item