Web scraping 如何获取此页面中的所有URL?

Web scraping 如何获取此页面中的所有URL?,web-scraping,scrapy,scrapy-spider,Web Scraping,Scrapy,Scrapy Spider,我正在做一个很枯燥的项目来获取这个页面和后续页面的所有url,但是当我运行spider时,我只从每个页面获取一个url!。我写了一个for循环来获取它们,但是什么都没有改变? 我需要得到每个广告数据到一个csv文件中的一行,如何做到这一点 蜘蛛代码: import datetime import urlparse import socket import re from scrapy.loader.processors import MapCompose, Join from scrapy.l

我正在做一个很枯燥的项目来获取这个页面和后续页面的所有url,但是当我运行spider时,我只从每个页面获取一个url!。我写了一个for循环来获取它们,但是什么都没有改变? 我需要得到每个广告数据到一个csv文件中的一行,如何做到这一点

蜘蛛代码:

import datetime
import urlparse
import socket
import re

from scrapy.loader.processors import MapCompose, Join
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader

from cars2buy.items import Cars2BuyItem

class Cars2buyCarleasingSpider(CrawlSpider):
    name = "cars2buy-carleasing"
    start_urls = ['http://www.cars2buy.co.uk/business-car-leasing/']

    rules = (
        Rule(LinkExtractor(allow=("Abarth"), restrict_xpaths='//*[@id="content"]/div[7]/div[2]/div/a')),
        Rule(LinkExtractor(allow=("695C"), restrict_xpaths='//*[@id="content"]/div/div/p/a'),  callback='parse_item', follow=True),
    Rule(LinkExtractor(restrict_xpaths='//*[@class="next"]'),  callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        for l in response.xpath('//*[@class="viewthiscar"]/@href'):
            item=Cars2BuyItem()
            item['Company']= l.extract()
            item['url']= response.url
            return item 
输出为:

> 2017-04-27 20:22:39 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.fleetprices.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-204097572&broker=178&veh_id=901651523&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr',  'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/'}
> 2017-04-27 20:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> (referer: http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/)
> 2017-04-27 20:22:40 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.jgleasing.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-207378762&broker=248&veh_id=902250527&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr',  'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2'}
> 2017-04-27 20:22:40 [scrapy.core.engine] INFO: Closing spider
> (finished)

问题是,一旦for循环处理了返回的第一个项,它就会离开parse_item方法,因此不会处理其他项

建议您将收益率替换为收益率:

def parse_item(self, response):
    for l in response.xpath('//*[@class="viewthiscar"]/@href'):
        item=Cars2BuyItem()
        item['Company']= l.extract()
        item['url']= response.url
        yield item