Scrapy 刮痕蜘蛛不会刮到第1页_Scrapy_Scrapy Spider

Scrapy 刮痕蜘蛛不会刮到第1页

scrapy

Scrapy 刮痕蜘蛛不会刮到第1页,scrapy,scrapy-spider,Scrapy,Scrapy Spider,我希望我的蜘蛛抓取网站每一页上的列表。我使用爬行蜘蛛和链接抽取器。但当我查看csv文件时，第一页（即开始url）上没有任何内容被删除。从第2页开始，删除项目。我在粘乎乎的壳上测试了我的爬虫，它看起来很好。我想不出问题出在哪里。下面是我的蜘蛛代码。请帮忙。非常感谢 import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from shputu

我希望我的蜘蛛抓取网站每一页上的列表。我使用爬行蜘蛛和链接抽取器。但当我查看csv文件时，第一页（即开始url）上没有任何内容被删除。从第2页开始，删除项目。我在粘乎乎的壳上测试了我的爬虫，它看起来很好。我想不出问题出在哪里。下面是我的蜘蛛代码。请帮忙。非常感谢

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from shputuo.items_shputuo import ShputuoItem


class Shputuo(CrawlSpider):
    name = "shputuo"

    allowed_domains = ["shpt.gov.cn"] # DO NOT use www in allowed domains
    start_urls =  ["http://www.shpt.gov.cn/gb/n6132/n6134/n6156/n7110/n7120/index.html"] 

    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[@class = 'page']/ul/li[5]/a",)), callback="parse_items", follow= True),
)    

    def parse_items(self, response):
        for sel in response.xpath("//div[@class = 'neirong']/ul/li"):
            item = ShputuoItem()
            word = sel.xpath("a/text()").extract()[0]
            item['id'] = word[3:11]
            item['title'] = word[11:len(word)]
            item['link'] = "http://www.shpt.gov.cn" + sel.xpath("a/@href").extract()[0]
            item['time2'] = sel.xpath("span/text()").extract()[0][1:11]

            request = scrapy.Request(item['link'], callback = self.parse_content)
            request.meta['item'] = item            

            yield request

    def parse_content(self, response):
        item = response.meta['item']
        item['question'] = response.xpath("//div[@id = 'ivs_content']/p[2]/text()").extract()[0]
        item['question'] = "".join(map(unicode.strip, item['question'])) # get rid of unwated spaces and others
        item['reply'] =  response.xpath("//div[@id = 'ivs_content']/p[3]/text()").extract()[0]
        item['reply'] = "".join(map(unicode.strip, item['reply']))
        item['agency'] = item['reply'][6:10]
        item['time1'] = "2015-" + item['question'][0] + "-" + item['question'][2]


        yield item

看起来您真正需要做的是解析

start\u URL

请求的元素，而不仅仅是遵循规则

为此，请使用

parse_start_url

方法，该方法是

start_url

请求的默认回调。

是

start_url

第1页中的url？@eLRuLL，是的。这是第一页。