Python 简单的刮擦爬虫不遵循链接&；刮_Python_Web Crawler_Scrapy

Python 简单的刮擦爬虫不遵循链接&；刮

python web-crawler scrapy

Python 简单的刮擦爬虫不遵循链接&；刮,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,基本上，问题在于遵循链接我从第1、2、3、4、5页开始，总共90页每个页面有100个左右的链接每页都是这种格式 http://www.consumercomplaints.in/lastcompanieslist/page/1 http://www.consumercomplaints.in/lastcompanieslist/page/2 http://www.consumercomplaints.in/lastcompanieslist/page/3 http://www.consum

基本上，问题在于遵循链接

我从第1、2、3、4、5页开始，总共90页

每个页面有100个左右的链接

每页都是这种格式

http://www.consumercomplaints.in/lastcompanieslist/page/1
http://www.consumercomplaints.in/lastcompanieslist/page/2
http://www.consumercomplaints.in/lastcompanieslist/page/3
http://www.consumercomplaints.in/lastcompanieslist/page/4

这是正则表达式匹配规则

Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")

我将进入每个页面，然后创建一个

Request

对象来刮取每个页面中的所有链接

Scrapy每次只爬行179个链接，然后给出一个

finished

状态

我做错了什么

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urlparse

class consumercomplaints_spider(CrawlSpider):
    name = "test_complaints"
    allowed_domains = ["www.consumercomplaints.in"]
    protocol='http://'

    start_urls = [
        "http://www.consumercomplaints.in/lastcompanieslist/"
    ]

    #These are the rules for matching the domain links using a regularexpression, only matched links are crawled
    rules = [
        Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
    ]


    def parse_data(self, response):
        #Get All the links in the page using xpath selector
        all_page_links = response.xpath('//td[@class="compl-text"]/a/@href').extract()

        #Convert each Relative page link to Absolute page link -> /abc.html -> www.domain.com/abc.html and then send Request object
        for relative_link in all_page_links:
            print "relative link procesed:"+relative_link

            absolute_link = urlparse.urljoin(self.protocol+self.allowed_domains[0],relative_link.strip())
            request = scrapy.Request(absolute_link,
                         callback=self.parse_complaint_page)
            return request


        return {}

    def parse_complaint_page(self,response):
        print "SCRAPED"+response.url
        return {}

你需要使用收益率而不是回报率

对于每个新请求对象，使用

yield Request

而不是

return
要求

查看更多关于收益率以及它们与理性之间的差异的信息

对不起，我没有得到答案。你需要抓取90个链接吗？“179页是什么？”纳宾编辑了这个问题，对不起。我需要遵循90页，每页有100个链接刮。Scrapy总共只有179个链接，你确定每个页面中的100个链接都在同一个域中吗？i、是的，我肯定。您可以通过在url末尾附加页面的页码来检查页面模板，就像这样，您可以看到我正在尝试爬网的链接的大列表。我使用xpath选择器获取链接。粘贴的代码有效。尝试直接运行代码，检查neededI是否希望看到您首先使用yield而不是return