Web scraping scrapy不处理imdb关键字页面_Web Scraping_Scrapy

Web scraping scrapy不处理imdb关键字页面

web-scraping scrapy

Web scraping scrapy不处理imdb关键字页面,web-scraping,scrapy,Web Scraping,Scrapy,以下是我打算如何使用此代码；我有一个关键词，比如“小工具”。我在高级imdb搜索页面上搜索标题。我希望代码转到每个标题页，然后转到每个标题的关键字页，然后下载标题和所有关键字。代码结构在我看来不错，但实际上不起作用。请建议是否需要重新编写，或者可以通过一些建议进行更正这是我的蜘蛛： import scrapy class KwordsSpider(scrapy.Spider): name= 'ImdbSpider' allowed_domains = ['imdb.co

以下是我打算如何使用此代码；我有一个关键词，比如“小工具”。我在高级imdb搜索页面上搜索标题。我希望代码转到每个标题页，然后转到每个标题的关键字页，然后下载标题和所有关键字。代码结构在我看来不错，但实际上不起作用。请建议是否需要重新编写，或者可以通过一些建议进行更正

这是我的蜘蛛：

import scrapy

class KwordsSpider(scrapy.Spider):
    name= 'ImdbSpider'
    allowed_domains = ['imdb.com']
    start_urls = [
        'https://www.imdb.com/search/title/?keywords=gadgets'
    ]    
    def parse(self, response):
        titleLinks = response.xpath('//*[@class="lister-item-content"]')

        for link in titleLinks:
            title_url = 'https://www.imdb.com'+link.xpath('.//h3/a/@href').extract_first()
            yield scrapy.Request(title_url, callback=self.parse_title)
        next_page_url = 'https://www.imdb.com'+response.xpath('//div[@class="article"]/div[@class="desc"]/a[@href]').extract_first()
        if next_page_url is not None:
            next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(next_page_url, callback=self.parse) 

    def parse_title(self, response):
        keywords_url = 'https://www.imdb.com' + response.xpath('//nobr/a[@href]').extract_first()

        yield scrapy.Request(keywords_url, callback=self.parse_keys)
    #looking at the keywords page
    def parse_keys(self, response):
        title = response.xpath('//h3/a/text()').extract_first()
        keys = response.xpath('//div[@class="sodatext"]/a/text()').extract()
        print('my print'+title)    
        yield{
            'title': title,
            'Keywords': keys,
        }

以下是一些电源外壳线

2020-05-02 08:33:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-02 08:33:40 [scrapy.core.engine] INFO: Spider opened
2020-05-02 08:33:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-02 08:33:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-02 08:33:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/search/title/?keywords=gadgets> (referer: None)
2020-05-02 08:33:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.imdb.com<a href="': <GET https://www.imdb.com<a href="/search/title/?keywords=gadgets&amp;start=51%22%20class=%22lister-page-next%20next-page%22%3ENext%20%C2%BB%3C/a%3E>
2020-05-02 08:33:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt3896198/> (referer: https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0369171/> (referer: https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt1149317/> (referer: https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11 [scrapy.core.engine] INFO: Closing spider (finished)

2020-05-02 08:33:40[scrapy.middleware]信息：启用的项目管道：
[]
2020-05-02 08:33:40[刮屑.堆芯.发动机]信息：十字轴已打开
2020-05-02 08:33:40[scrapy.extensions.logstats]信息：抓取0页（以0页/分钟的速度），抓取0项（以0项/分钟的速度）
2020-05-02 08:33:40[scrapy.extensions.telnet]信息：telnet控制台监听127.0.0.1:6023
2020-05-02 08:33:43[碎片堆芯引擎]调试：爬网（200）（参考：无）
2020-05-02 08:33:43[scrapy.spidermiddleware.offsite]调试：过滤到“www.imdb.com”的非现场请求
2020-05-02 08:33:46[刮屑核心引擎]调试：爬网（200）（参考：https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11[刮屑核心引擎]调试：爬网（200）（参考：https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11[刮屑核心引擎]调试：爬网（200）（参考：https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11[刮屑堆芯发动机]信息：关闭卡盘（已完成）

脚本中有几个XPath错误。我已经修好了。现在应该可以了

class KwordsSpider(scrapy.Spider):
    name = 'ImdbSpider'
    start_urls = [
        'https://www.imdb.com/search/title/?keywords=gadgets'
    ]    
    def parse(self, response):
        titleLinks = response.xpath('//*[@class="lister-item-content"]')

        for link in titleLinks:
            title_url = response.urljoin(link.xpath('.//h3/a/@href').get())
            yield scrapy.Request(title_url, callback=self.parse_title)

        next_page_url = response.xpath('//div[@class="article"]/div[@class="desc"]/a/@href').get()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(next_page_url, callback=self.parse) 

    def parse_title(self, response):
        keywords_url = response.urljoin(response.xpath('//nobr/a/@href').get())
        yield scrapy.Request(keywords_url, callback=self.parse_keys)

    def parse_keys(self, response):
        title = response.xpath('//h3/a/text()').get()
        keys = response.xpath('//div[@class="sodatext"]/a/text()').getall()
        yield {
            'title': title,
            'Keywords': keys,
        }

有一件事我很困惑，我试图在yield{'title'：title，'Link'：'title\uurl'，'Keywords'：keys，}中获取title\uurl，但它不起作用。怎么做呢？我不确定我能理解你的问题。我可以在

parse（）

方法中看到

title\u url

。您想在

parse_keys（）

方法中打印它吗？你能澄清一下吗？是的，我想知道。我使用meta，并且我能够将变量方法转换为方法。