Python 几页后，Scrapy停止爬行_Python_Web Scraping_Web Crawler_Scrapy

Python 几页后，Scrapy停止爬行

python web-scraping web-crawler scrapy

Python 几页后，Scrapy停止爬行,python,web-scraping,web-crawler,scrapy,Python,Web Scraping,Web Crawler,Scrapy,我只是在学习Scrapy和网站爬虫的基础知识，所以我非常感谢您的意见。在教程的指导下，我用Scrapy构建了一个简单明了的爬虫程序它工作得很好，但它不会像它应该的那样抓取所有页面我的蜘蛛代码是： from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http.request import Request from fraist.items

我只是在学习Scrapy和网站爬虫的基础知识，所以我非常感谢您的意见。在教程的指导下，我用Scrapy构建了一个简单明了的爬虫程序

它工作得很好，但它不会像它应该的那样抓取所有页面

我的蜘蛛代码是：

from scrapy.spider       import BaseSpider
from scrapy.selector     import HtmlXPathSelector
from scrapy.http.request import Request
from fraist.items        import FraistItem
import re

class fraistspider(BaseSpider):
    name = "fraistspider"
    allowed_domain = ["99designs.com"]
    start_urls = ["http://99designs.com/designer-blog/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select("//div[@class='pagination']/a/@href").extract()

        #We stored already crawled links in this list
        crawledLinks    = []

        #Pattern to check proper link
        linkPattern     = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$")

        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            if linkPattern.match(link) and not link in crawledLinks:
                crawledLinks.append(link)
                yield Request(link, self.parse)

        posts = hxs.select("//article[@class='content-summary']")
        items = []
        for post in posts:
            item = FraistItem()
            item["title"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/text()").extract()
            item["link"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/@href").extract()
            item["content"] = post.select("div[@class='summary']/p/text()").extract()
            items.append(item)
        for item in items:
            yield item

输出为：

         'title': [u'Design a poster in the style of Saul Bass']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Scraped from <200 http://nnbdesig
ner.wpengine.com/designer-blog/>
        {'content': [u'Helping a company come up with a branding strategy can be
 exciting\xa0and intimidating, all at once. It gives a designer the opportunity
to make a great visual impact with a brand, but requires skills in logo, print a
nd digital design. If you\u2019ve been hesitating to join a 99designs Brand Iden
tity Pack contest, here are a... '],
         'link': [u'http://99designs.com/designer-blog/2015/05/07/tips-brand-ide
ntity-pack-design-success/'],
         'title': [u'99designs\u2019 tips for a successful Brand Identity Pack d
esign']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/10/
>
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/11/
>
2015-05-20 16:22:41+0100 [fraistspider] INFO: Closing spider (finished)
2015-05-20 16:22:41+0100 [fraistspider] INFO: Stored csv feed (100 items) in: da
ta.csv
2015-05-20 16:22:41+0100 [fraistspider] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 4425,
         'downloader/request_count': 16,
         'downloader/request_method_count/GET': 16,
         'downloader/response_bytes': 126915,
         'downloader/response_count': 16,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 5,
         'dupefilter/filtered': 41,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 5, 20, 15, 22, 41, 738000),
         'item_scraped_count': 100,
         'log_count/DEBUG': 119,
         'log_count/INFO': 8,
         'request_depth_max': 5,
         'response_received_count': 11,
         'scheduler/dequeued': 16,
         'scheduler/dequeued/memory': 16,
         'scheduler/enqueued': 16,
         'scheduler/enqueued/memory': 16,
         'start_time': datetime.datetime(2015, 5, 20, 15, 22, 40, 718000)}
2015-05-20 16:22:41+0100 [fraistspider] INFO: Spider closed (finished)

“标题”：[u'设计一张索尔·巴斯风格的海报]]
2015-05-20 16:22:41+0100[fraistspider]调试：从
{'content'：[u'帮助一家公司制定品牌战略
激动人心\xa0同时又令人生畏。这给了设计师一个机会
要对品牌产生巨大的视觉冲击，但需要徽标技能，请打印
nd digital design。如果您一直在犹豫是否加入99designs品牌Iden
奶头包大赛，这里有一个…。]，
'链接'：[u'http://99designs.com/designer-blog/2015/05/07/tips-brand-ide
ntity pack设计成功/']，
“标题”：[u'99designs\u2019成功的品牌标识包的提示d
esign']}
2015-05-20 16:22:41+0100[fraistspider]调试：重定向（301）到
2015-05-20 16:22:41+0100[fraistspider]调试：重定向（301）到
2015-05-20 16:22:41+0100[fraistspider]信息：关闭卡盘（已完成）
2015-05-20 16:22:41+0100[fraistspider]信息：存储在da中的csv提要（100项）
ta.csv
2015-05-20 16:22:41+0100[fraistspider]信息：倾倒碎屑统计数据：
{'downloader/request_bytes'：4425，
“下载程序/请求计数”：16，
“下载程序/请求方法/计数/获取”：16，
“downloader/response_字节”：126915，
“下载程序/响应计数”：16，
“下载/响应状态\计数/200”：11，
“下载/响应状态\计数/301”：5，
“dupefilter/filtered”：41，
“完成原因”：“完成”，
“完成时间”：datetime.datetime（2015,5,20,15,22,41738000），
“物料刮擦计数”：100，
“日志计数/调试”：119，
“日志计数/信息”：8，
“请求深度最大值”：5，
“收到的响应数”：11，
“调度程序/出列”：16，
“调度程序/出列/内存”：16，
“调度程序/排队”：16，
“调度程序/排队/内存”：16，
“开始时间”：datetime.datetime（2015,5,20,15,22,40718000）}
2015-05-20 16:22:41+0100[fraistspider]信息：Spider关闭（完成）

正如您所看到的，

“项目刮取计数”是100，尽管它应该更多，因为总共有122页，每页10篇文章
从输出中，我可以看到301重定向问题，但我不明白这为什么会导致问题。我尝试了另一种方法来重写我的spider代码，但在同一部分的几个条目之后，它再次中断
任何帮助都将不胜感激。谢谢大家!
 似乎达到了中定义的默认100项
在本例中，我将使用一个来抓取多个页面，因此您必须定义一个与99designs.com中的页面匹配的，并稍微修改您的解析函数以处理该项
C&p公司：
编辑：我刚刚发现其中包含一个有用的示例。谢谢您的回复。这很有帮助。我已经设法改写了我的蜘蛛有点，但不幸的是，它爬行其他链接（如），所以它不会考虑我的规则，出于某种原因。这是我的新代码：我不完全确定我是否在应该定义的地方定义了links变量。编辑：很快，它将通过所有内部链接（站点范围），而不是来自页面导航的链接。嗨，阿德里安。是的，这是因为您没有正确定义规则，在我粘贴的示例中，规则将匹配包含category.php的页面，但您放置了allow（“”），它基本上允许scraper访问站点上的任何页面。放一个匹配“page//”或类似内容的正则表达式。
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item