Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/299.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 管道文件中存在一个问题,因为它无法获取图书名称。而是为每个爬网保存一个随机图像,其中包含None.jpg_Python_Scrapy_Filenames - Fatal编程技术网

Python 管道文件中存在一个问题,因为它无法获取图书名称。而是为每个爬网保存一个随机图像,其中包含None.jpg

Python 管道文件中存在一个问题,因为它无法获取图书名称。而是为每个爬网保存一个随机图像,其中包含None.jpg,python,scrapy,filenames,Python,Scrapy,Filenames,items.py文件。正如我所知,image_url和images字段。这没有造成任何问题 import scrapy from scrapy.loader.processors import TakeFirst class BooksToScrapeItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field() book_name = scrapy.Field( output

items.py文件。正如我所知,image_url和images字段。这没有造成任何问题

import scrapy
from scrapy.loader.processors import TakeFirst

class BooksToScrapeItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    book_name = scrapy.Field(
        output_processor = TakeFirst()
    )
pipelines.py文件。我认为get_media_请求方法中一定有问题,因为它没有从items文件中获取图书名称

from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request


class BooksToScrapeImagePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        return [Request(x,meta={'bookname': item.get('book_name')}) for x in item.get(self.images_urls_field, [])] #i think that the problem is in this line

    def file_path(self, request, response=None, info=None):

        return 'full/%s.jpg' % (request.meta['bookname'])
spider文件,我使用它进行刮取。它在我没有自定义管道文件时工作

import scrapy
from scrapy.loader import ItemLoader
from books_to_scrape.items import BooksToScrapeItem

class ImgscrapeSpider(scrapy.Spider):
    name = 'imgscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com']

    def parse(self, response):
        for article in response.xpath("//article[@class='product_pod']"):
            loader = ItemLoader(item=BooksToScrapeItem(),selector=article)
            relative_url = article.xpath(".//div/a/img[@class='thumbnail']/@src").extract_first()
            abs_url = response.urljoin(relative_url)
            loader.add_value('image_urls',abs_url)
            loader.add_xpath('book_name',".//article[@class='product_pod']/h3/a/text()") 
            yield loader.load_item() 

您的问题在相对xpath中

loader.add_xpath('book_name', ".//article[@class='product_pod']/h3/a/text()") 
加载器使用
xpath(//article[@class='product\u pod'])
作为选择器

   for article in response.xpath("//article[@class='product_pod']"):
        loader = ItemLoader(item=BooksToScrapeItem(), selector=article)
因此,所有相对xpath都是相对于xpath中的
“//article[@class='product\u pod']”
,它们不需要
“//article[@class='product\u pod']”

使用相对xpath
“//article[@class='product\u pod']/h3/a/text()”
它找不到标题,因此
图书名
对于所有项目都是空的,对于所有项目它都使用
作为标题,并且对所有图像使用相同的名称
无.jpg


一定是这样

loader.add_xpath('book_name', ".//h3/a/text()")  # title with `...`

顺便说一句:
text()
没有完整的标题,但带有
。要获得完整的标题,您必须获得属性
title=

loader.add_xpath('book_name', ".//h3/a/@title")  # full title

我在一个文件中创建了一个包含所有代码的版本,以便在不创建项目的情况下运行它

每个人都可以将其复制到单个文件并运行以测试它

import scrapy
from scrapy.loader.processors import TakeFirst

class BooksToScrapeItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    book_name = scrapy.Field(
        output_processor = TakeFirst()
    )

from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline

class BooksToScrapeImagePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        return [Request(x, meta={'bookname': item.get('book_name')}) for x in item.get(self.images_urls_field, [])] #i think that the problem is in this line

    def file_path(self, request, response=None, info=None):
        return 'full/%s.jpg' % request.meta['bookname']

from scrapy.loader import ItemLoader

class ImgscrapeSpider(scrapy.Spider):
    name = 'imgscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com']

    def parse(self, response):
        for article in response.xpath("//article[@class='product_pod']"):

            loader = ItemLoader(item=BooksToScrapeItem(),selector=article)

            relative_url = article.xpath(".//div/a/img[@class='thumbnail']/@src").extract_first()
            abs_url = response.urljoin(relative_url)

            loader.add_value('image_urls', abs_url)
            #loader.add_xpath('book_name',".//article[@class='product_pod']/h3/a/text()")  # wrong relative xpath 
            #loader.add_xpath('book_name', ".//h3/a/text()")  # only partial title
            loader.add_xpath('book_name', ".//h3/a/@title")  # full title

            yield loader.load_item() 

# -----------------------------------------------------------------------------

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #

    # download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
    # it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work

    'ITEM_PIPELINES': {'__main__.BooksToScrapeImagePipeline': 1},            # used Pipeline create in current file (needs __main___)
    'IMAGES_STORE': '.',                   # this folder has to exist before downloading

})
c.crawl(ImgscrapeSpider)
c.start()