要转到下一页的Python LinkedExtractor不';行不通

要转到下一页的Python LinkedExtractor不';行不通,python,scrapy,web-crawler,scrapy-spider,Python,Scrapy,Web Crawler,Scrapy Spider,下一步是一段代码,我必须尝试爬网一个网站超过1页。。。我很难让规则类正常工作。我做错了什么 #import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from tutorial.items import SkodaItem class SkodaSpider(CrawlSpider): name = "skodas" al

下一步是一段代码,我必须尝试爬网一个网站超过1页。。。我很难让规则类正常工作。我做错了什么

#import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import SkodaItem

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=True),
    ]

#    def parse_item(self, response):
    def parse(self, response):
        #self.logger.info('Hi, this is an item page! %s', response.url)
        x = 0
        items = []
        for sel in response.xpath('//*[@id="search-results"]/section[2]/article'):
            x = x + 1
            item = SkodaItem()
            item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')
            #print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()
            item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')
            item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')
            item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')

            #handle output (print or safe to database)
            items.append(item)
            print item ["title"],item["leeftijd"],item["prijs"],item["km"]
#导入刮屑
从scrapy.spider导入爬行蜘蛛,规则
从scrapy.LinkExtractor导入LinkExtractor
从tutorial.items导入SkodaItem
斯考达蜘蛛类(爬行蜘蛛):
name=“skodas”
允许的_域=[“marktplaats.nl”]
起始URL=[
"http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
]
规则=[
规则(LinkExtractor(restrict_xpath=('//a[@class=“button secondary medium pagination next”]/a')),follow=True),
]
#def解析_项(自身、响应):
def解析(自我,响应):
#self.logger.info('您好,这是一个项目页面!%s',response.url)
x=0
项目=[]
对于response.xpath('/*[@id=“search results”]/section[2]/article')中的sel:
x=x+1
item=SkodaItem()
item[“title”]=sel.xpath('/*[@id=“search results”]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span')。re('.+>(.+)'))
#print sel.xpath('/*[@id=“search results”]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span')。extract()
item[“leeftijd”]=sel.xpath('/*[@id=“search results”]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]')。re('.+'>(.+'))
item[“prijs”]=sel.xpath('/*[@id=“search results”]/section[2]/article['+str(x)+]/div/div[2]/div[1]/div/div')。re('.+\n+(.+)\n.+'))
item[“km”]=sel.xpath('/*[@id=“search results”]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]')。re('.+'>(.+'))
#处理输出(打印或安全到数据库)
items.append(项目)
打印项目[“title”]、项目[“leeftijd”]、项目[“prijs”]、项目[“km”]

有几件事需要改变:

  • 当使用爬行爬行器时,它就是这种特殊爬行器类型的“魔力”所在
在编写爬行爬行器规则时,避免使用parse作为回调,因为爬行爬行器使用parse方法本身来实现其逻辑。因此,如果重写解析方法,爬行爬行器将不再工作

  • 正如我在评论中提到的,您的XPath需要通过删除末尾额外的
    /a
    来修复(链接中的链接将不匹配任何元素)
  • CrawlSpider
    如果要从后续页面提取项目,则规则需要回调方法
  • 要同时解析起始URL中的元素,需要定义
这是一个极简的
爬行爬行器
,跟随您的示例输入的3页,并打印出每页中有多少“文章”:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),
             follow=True,
             callback='parse_page'),
    ]

    def parse_page(self, response):
        articles = response.css('#search-results > section + section > article')
        self.logger.info('%d articles' % len(articles))

    # define this, otherwise "parse_page" will not be called for the URLs in start_urls
    parse_start_url = parse_page
输出:

$ scrapy runspider 001.py 
2016-02-09 11:07:16 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-09 11:07:16 [scrapy] INFO: Optional features available: ssl, http11
2016-02-09 11:07:16 [scrapy] INFO: Overridden settings: {}
2016-02-09 11:07:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-09 11:07:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-09 11:07:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-09 11:07:16 [scrapy] INFO: Enabled item pipelines: 
2016-02-09 11:07:16 [scrapy] INFO: Spider opened
2016-02-09 11:07:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-09 11:07:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-09 11:07:16 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always> (referer: None)
2016-02-09 11:07:16 [skodas] INFO: 32 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always)
2016-02-09 11:07:17 [skodas] INFO: 30 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=3&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010)
2016-02-09 11:07:17 [skodas] INFO: 7 articles
2016-02-09 11:07:17 [scrapy] INFO: Closing spider (finished)
2016-02-09 11:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1919,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 96682,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)}
2016-02-09 11:07:17 [scrapy] INFO: Spider closed (finished)
$scrapy runspider 001.py
2016-02-09 11:07:16[scrapy]信息:scrapy 1.0.4已启动(机器人程序:scrapybot)
2016-02-09 11:07:16[scrapy]信息:可选功能:ssl、http11
2016-02-09 11:07:16[scrapy]信息:覆盖的设置:{}
2016-02-09 11:07:16[scrapy]信息:启用的扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState
2016-02-09 11:07:16[scrapy]信息:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2016-02-09 11:07:16[scrapy]信息:启用的蜘蛛中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2016-02-09 11:07:16[scrapy]信息:启用的项目管道:
2016-02-09 11:07:16[scrapy]信息:蜘蛛打开了
2016-02-09 11:07:16[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2016-02-09 11:07:16[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2016-02-09 11:07:16[scrapy]调试:爬网(200)(参考:无)
2016-02-09 11:07:16[斯科达斯]信息:32篇文章
2016-02-09 11:07:17[scrapy]调试:爬网(200)(参考:http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always)
2016-02-09 11:07:17[斯科达斯]信息:30篇文章
2016-02-09 11:07:17[scrapy]调试:爬网(200)(参考:http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010)
2016-02-09 11:07:17[斯科达斯]信息:7篇文章
2016-02-09 11:07:17[scrapy]信息:关闭卡盘(已完成)
2016-02-09 11:07:17[scrapy]信息:倾销scrapy统计数据:
{'downloader/request_bytes':1919,
“下载程序/请求计数”:3,
“下载程序/请求方法\计数/获取”:3,
“downloader/response_字节”:96682,
“下载程序/响应计数”:3,
“下载/响应状态\计数/200”:3,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2016,2,9,10,7,17638179),
“日志计数/调试”:4,
“日志计数/信息”:10,
“请求深度最大值”:2,
“收到的响应数”:3,
“调度程序/出列”:3,
“调度程序/出列/内存”:3,
“调度程序/排队”:3,
“调度程序/排队/内存”:3,
“开始时间”:datetime.datetime(2016,2,9,10,7,16452272)}
2016-02-09 11:07:17[scrapy]信息:蜘蛛关闭(完成)

有几件事需要改变:

  • 当使用爬行爬行器时,它就是这种特殊爬行器类型的“魔力”所在
在编写爬行爬行器规则时,避免使用parse作为回调,因为爬行爬行器使用parse方法本身来实现其逻辑。因此,如果您重写parse方法,爬行爬行器将不会