Python 为什么scrapy spider在抓取amazon时不返回任何输出?

Python 为什么scrapy spider在抓取amazon时不返回任何输出?,python,web-scraping,scrapy,css-selectors,Python,Web Scraping,Scrapy,Css Selectors,我在亚马逊畅销书页面上测试了scrapy spider(见下面的URL),但它返回奇怪的价格数字,或者根本没有输出,正如您在最后的输出中看到的(我只共享了1页的输出)。css选择器可能有问题,但我不确定。我希望spider将输出保存在JSON文件中,以便我可以快速将其转换为pandas dataframe进行分析。这是我在终端中编写的运行爬行器的代码:scrapy crawl amazon_booksUK-o somefilename.json 我知道这有很多东西要看,但如果你有时间的话,这真的

我在亚马逊畅销书页面上测试了scrapy spider(见下面的URL),但它返回奇怪的价格数字,或者根本没有输出,正如您在最后的输出中看到的(我只共享了1页的输出)。css选择器可能有问题,但我不确定。我希望spider将输出保存在JSON文件中,以便我可以快速将其转换为pandas dataframe进行分析。这是我在终端中编写的运行爬行器的代码:scrapy crawl amazon_booksUK-o somefilename.json

我知道这有很多东西要看,但如果你有时间的话,这真的会帮我解决的!:)

url=

1。蜘蛛代码:

import scrapy
from ..items import AmazonscrapeItem

class AmazonSpiderSpider(scrapy.Spider):
    page_number = 2
    name = 'amazon_booksUK'
    start_urls = [
        'https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page=1&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_1'
    ]

    def parse(self, response):
        items = AmazonscrapeItem()

        # if multiple classes --> .css("::text").extract()
        product_name = response.css('.a-color-base.a-text-normal::text').extract()
        product_author = response.css('.a-color-secondary .a-size-base.a-link-normal').css('::text').extract()
        product_nbr_reviews = response.css('.a-size-small .a-link-normal .a-size-base').css('::text').extract()
        product_type = response.css('.a-spacing-top-small .a-link-normal.a-text-bold').css('::text').extract()
        product_price = response.css('.a-spacing-top-small .a-price-whole').css('::text').extract()
        product_more_choice = response.css('.a-spacing-top-mini .a-color-secondary .a-link-normal').css('::text').extract()
        # this only selects the element that has the image --> need stuff inside src (source attr)
        product_imagelink = response.css('.s-image::attr(src)').extract() # want attr of src

        items['product_name'] = product_name
        items['product_author'] = product_author
        items['product_nbr_reviews'] = product_nbr_reviews
        items['product_type'] = product_type
        items['product_price'] = product_price
        items['product_more_choice'] = product_more_choice
        items['product_imagelink'] = product_imagelink

        yield items

        next_page = 'https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page='+ str(AmazonSpiderSpider.page_number)+'&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_'+ str(AmazonSpiderSpider.page_number)
        if AmazonSpiderSpider.page_number <3:
            AmazonSpiderSpider.page_number += 1
            yield response.follow(next_page, callback=self.parse)
import scrapy

    class AmazonscrapeItem(scrapy.Item):
        # define the fields for your item here like:
        product_name = scrapy.Field()
        product_author = scrapy.Field()
        product_nbr_reviews = scrapy.Field()
        product_type = scrapy.Field()
        product_price = scrapy.Field()
        product_more_choice = scrapy.Field()
        product_imagelink = scrapy.Field()
class AmazonscrapePipeline:
    def process_item(self, item, spider):
        return item
3。settings.py: 我使用谷歌机器人用户代理来避免在测试刮板时被禁止

USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
4。管道。py:

import scrapy
from ..items import AmazonscrapeItem

class AmazonSpiderSpider(scrapy.Spider):
    page_number = 2
    name = 'amazon_booksUK'
    start_urls = [
        'https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page=1&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_1'
    ]

    def parse(self, response):
        items = AmazonscrapeItem()

        # if multiple classes --> .css("::text").extract()
        product_name = response.css('.a-color-base.a-text-normal::text').extract()
        product_author = response.css('.a-color-secondary .a-size-base.a-link-normal').css('::text').extract()
        product_nbr_reviews = response.css('.a-size-small .a-link-normal .a-size-base').css('::text').extract()
        product_type = response.css('.a-spacing-top-small .a-link-normal.a-text-bold').css('::text').extract()
        product_price = response.css('.a-spacing-top-small .a-price-whole').css('::text').extract()
        product_more_choice = response.css('.a-spacing-top-mini .a-color-secondary .a-link-normal').css('::text').extract()
        # this only selects the element that has the image --> need stuff inside src (source attr)
        product_imagelink = response.css('.s-image::attr(src)').extract() # want attr of src

        items['product_name'] = product_name
        items['product_author'] = product_author
        items['product_nbr_reviews'] = product_nbr_reviews
        items['product_type'] = product_type
        items['product_price'] = product_price
        items['product_more_choice'] = product_more_choice
        items['product_imagelink'] = product_imagelink

        yield items

        next_page = 'https://www.amazon.co.uk/s?i=stripbooks&bbn=266239&rh=n%3A266239%2Cp_72%3A184315031%2Cp_36%3A389028011&dc&page='+ str(AmazonSpiderSpider.page_number)+'&fst=as%3Aoff&qid=1598942460&rnid=389022011&ref=sr_pg_'+ str(AmazonSpiderSpider.page_number)
        if AmazonSpiderSpider.page_number <3:
            AmazonSpiderSpider.page_number += 1
            yield response.follow(next_page, callback=self.parse)
import scrapy

    class AmazonscrapeItem(scrapy.Item):
        # define the fields for your item here like:
        product_name = scrapy.Field()
        product_author = scrapy.Field()
        product_nbr_reviews = scrapy.Field()
        product_type = scrapy.Field()
        product_price = scrapy.Field()
        product_more_choice = scrapy.Field()
        product_imagelink = scrapy.Field()
class AmazonscrapePipeline:
    def process_item(self, item, spider):
        return item
输出
默认的交互式shell现在是zsh。
要更新您的帐户以使用zsh,请运行'chsh-s/bin/zsh'。
欲了解更多详情,请访问https://support.apple.com/kb/HT208050.
(venv)(基本)Eriks MBP:scrapyTutorial erikhren$cd amazonscrape
(venv)(基本)Eriks MBP:amazonscrape erikhren$scrapy crawl amazon_booksUK-o mm.json
2020-09-01 15:01:43[scrapy.utils.log]信息:scrapy 2.3.0已启动(机器人:amazonscrape)
2020-09-01 15:01:43[scrapy.utils.log]信息:版本:lxml 4.5.2.0,libxml2.9.10,CSSELECT 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.7.3(默认,2020年3月6日,22:34:30)-[Clang 11.0.3(Clang-1103.0.32.29)],pyOpenSSL 19.1.0(OpenSSL 1.1.1.1g,2020年4月21日),3.1,平台加密-19.6.0-x86位
2020-09-01 15:01:43[scrapy.utils.log]调试:使用reactor:twisted.internet.selectreactor.selectreactor
2020-09-01 15:01:43[刮擦爬虫]信息:覆盖设置:
{'BOT_NAME':'amazonscrape',
“NEWSPIDER_模块”:“amazonscrape.spider”,
“机器人服从”:没错,
“SPIDER_模块”:[“amazonscrape.SPIDER”],
“用户代理”:“Mozilla/5.0(兼容;Googlebot/2.1;”
'+http://www.google.com/bot.html)'}
2020-09-01 15:01:43[scrapy.extensions.telnet]信息:telnet密码:e57aa4639df179b5
2020-09-01 15:01:43[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.corestats.corestats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.logstats']
2020-09-01 15:01:44[剪贴簿中间件]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.robotstxt.RobotsTxtMiddleware',
'scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2020-09-01 15:01:44[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2020-09-01 15:01:44[scrapy.middleware]信息:启用的项目管道:
['amazonscrape.pipelines.AmazonscrapePipeline']
2020-09-01 15:01:44[刮屑.堆芯.发动机]信息:十字轴已打开
2020-09-01 15:01:44[scrapy.extensions.logstats]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2020-09-01 15:01:44[scrapy.extensions.telnet]信息:telnet控制台监听127.0.0.1:6023
2020-09-01 15:01:44[碎片堆芯引擎]调试:爬网(200)(参考:无)
2020-09-01 15:01:44[scrapy.downloadermiddleware.redirect]调试:重定向(301)到
2020-09-01 15:01:44[scrapy.downloadermiddleware.redirect]调试:重定向(301)到
2020-09-01 15:01:45[scrapy.core.engine]调试:爬网(200)(参考:无)
2020-09-01 15:01:45[scrapy.core.scraper]调试:从
{'product_author':[],
“product_imagelink”:[],
“产品更多选择”:[],
“产品名称”:[],
“产品编号审查”:[],
“产品价格”:[11],
'.',
'12',
'.',
'9',
'.',
'6',
'.',
'9',
'.',
'9',
'.',
'6',
'.',
'8',
'.',
'30',
'.',
'13',
'.',
'11',
'.',
'10',
'.',
'7',
'.',
'6',
'.',
'24',
'.',
'39',
'.',
'20',
'.',
'34',
'.',
'6',
'.',
'48',
'.',
'52',
'.',
'47',
'.',
'14',
'.',
'20',
'.',
'17',
'.',
'14',
'.',
'71',
'.',
'34',
'.',
'34',