Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/365.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy正在抛出URL错误_Python_Mongodb_Web Scraping_Scrapy - Fatal编程技术网

Python Scrapy正在抛出URL错误

Python Scrapy正在抛出URL错误,python,mongodb,web-scraping,scrapy,Python,Mongodb,Web Scraping,Scrapy,我正在尝试设计一个用于从flipkart中抓取数据的网络爬虫。我正在使用mongoDB来存储数据。我的代码如下: WebSpider.py from scrapy.spider import CrawlSpider from scrapy.selector import Selector from spider_web.items import SpiderWebItem class WebSpider(CrawlSpider): name = "spider_web" a

我正在尝试设计一个用于从flipkart中抓取数据的网络爬虫。我正在使用mongoDB来存储数据。我的代码如下:

WebSpider.py

from scrapy.spider import CrawlSpider
from scrapy.selector import Selector
from spider_web.items import SpiderWebItem

class WebSpider(CrawlSpider):
     name = "spider_web"
     allowed_domains = ["http://www.flipkart.com"]
     start_urls = [
           "http://www.flipkart.com/search?q=amish+tripathi",
     ]
     def parse(self, response):
          books = response.selector.xpath(
             '//div[@class="old-grid"]/div[@class="gd-row browse-grid-row"]')

    for book in books:
        item = SpiderWebItem()

        item['title'] = book.xpath(
            './/div[@class="pu-details lastUnit"]/div[@class="pu-title fk-font-13"]/a[contains(@href, "from-search")]/@title').extract()[0].strip()

        item['rating'] = book.xpath(
            './/div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0]

        item['noOfRatings'] = book.xpath(
            './/div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/text()').extract()[1].strip()

        item['url'] = response.url

        yield item
items.py

 from scrapy.item import Item, Field

 class SpiderWebItem(Item):
     url = Field()
     title = Field()
     rating = Field()
     noOfRatings = Field()
管道.py

 import pymongo

 from scrapy.conf import settings
 from scrapy.exceptions import DropItem
 from scrapy import log


 class MongoDBPipeline(object):

      def __init__(self):
          connection = pymongo.MongoClient(
               settings['MONGODB_SERVER'],
               settings['MONGODB_PORT']
          )
          db = connection[settings['MONGODB_DB']]
          self.collection = db[settings['MONGODB_COLLECTION']]

     def process_item(self, item, spider):
         for data in item:
              if not data:
                 raise DropItem("Missing data!")
         self.collection.update({'title': item['title']}, dict(item), upsert=True)
         log.msg("book added to MongoDB database!",
            level=log.DEBUG, spider=spider)
         return item
设置.py 机器人名称='蜘蛛网'

 SPIDER_MODULES = ['spider_web.spiders']
 NEWSPIDER_MODULE = 'spider_web.spiders'
 DOWNLOAD_HANDLERS = {
      's3': None,
 }
 DOWNLOAD_DELAY = 0.25
 DEPTH_PRIORITY = 1
 SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
 SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
 ITEM_PIPELINES = ['spider_web.pipelines.MongoDBPipeline', ]

 MONGODB_SERVER = "localhost"
 MONGODB_PORT = 27017
 MONGODB_DB = "flipkart"
 MONGODB_COLLECTION = "books"
我用scrapy shell检查了每个xpath。他们正在产生正确的结果。但是启动URL正在抛出。运行spider时的错误是:

2015-10-05 20:05:10 [scrapy] ERROR: Spider error processing <GET http://www.flipkart.com/search?q=rabindranath+tagore> (
referer: None)

........

  File "F:\myP\Web Scraping\spider_web\spider_web\spiders\WebSpider.py", line 21, in parse
    './/div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0]
IndexError: list index out of range
2015-10-05 20:05:10[scrapy]错误:蜘蛛错误处理(
推荐人:无)
........
文件“F:\myP\Web Scraping\spider\u-Web\spider\WebSpider\WebSpider.py”,第21行,在parse中
'.//div[@class=“pu details lastUnit”]/div[@class=“pu rating”]/div[1]/@title')。extract()[0]
索引器:列表索引超出范围
我在这里束手无策。爬行器获取一两个项目的数据,然后引发错误,爬行器同时停止。任何帮助都将不胜感激。提前谢谢。

有些书没有评级,你也需要处理它们。例如:

try:
    item['rating'] = book.xpath('.//div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0]
except IndexError:
    item['rating'] = 'no rating'
但是,我实际上会考虑使用输入和输出处理器,让它们处理这些情况。

有些书没有评级,你也需要处理它们。例如:

try:
    item['rating'] = book.xpath('.//div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0]
except IndexError:
    item['rating'] = 'no rating'
但是,我会考虑使用输入和输出处理器,让它们处理这些情况