Python 几页后,Scrapy停止爬行
我只是在学习Scrapy和网站爬虫的基础知识,所以我非常感谢您的意见。在教程的指导下,我用Scrapy构建了一个简单明了的爬虫程序 它工作得很好,但它不会像它应该的那样抓取所有页面 我的蜘蛛代码是:Python 几页后,Scrapy停止爬行,python,web-scraping,web-crawler,scrapy,Python,Web Scraping,Web Crawler,Scrapy,我只是在学习Scrapy和网站爬虫的基础知识,所以我非常感谢您的意见。在教程的指导下,我用Scrapy构建了一个简单明了的爬虫程序 它工作得很好,但它不会像它应该的那样抓取所有页面 我的蜘蛛代码是: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http.request import Request from fraist.items
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from fraist.items import FraistItem
import re
class fraistspider(BaseSpider):
name = "fraistspider"
allowed_domain = ["99designs.com"]
start_urls = ["http://99designs.com/designer-blog/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select("//div[@class='pagination']/a/@href").extract()
#We stored already crawled links in this list
crawledLinks = []
#Pattern to check proper link
linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
if linkPattern.match(link) and not link in crawledLinks:
crawledLinks.append(link)
yield Request(link, self.parse)
posts = hxs.select("//article[@class='content-summary']")
items = []
for post in posts:
item = FraistItem()
item["title"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/text()").extract()
item["link"] = post.select("div[@class='summary']/h3[@class='entry-title']/a/@href").extract()
item["content"] = post.select("div[@class='summary']/p/text()").extract()
items.append(item)
for item in items:
yield item
输出为:
'title': [u'Design a poster in the style of Saul Bass']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Scraped from <200 http://nnbdesig
ner.wpengine.com/designer-blog/>
{'content': [u'Helping a company come up with a branding strategy can be
exciting\xa0and intimidating, all at once. It gives a designer the opportunity
to make a great visual impact with a brand, but requires skills in logo, print a
nd digital design. If you\u2019ve been hesitating to join a 99designs Brand Iden
tity Pack contest, here are a... '],
'link': [u'http://99designs.com/designer-blog/2015/05/07/tips-brand-ide
ntity-pack-design-success/'],
'title': [u'99designs\u2019 tips for a successful Brand Identity Pack d
esign']}
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/10/
>
2015-05-20 16:22:41+0100 [fraistspider] DEBUG: Redirecting (301) to <GET http://
nnbdesigner.wpengine.com/> from <GET http://99designs.com/designer-blog/page/11/
>
2015-05-20 16:22:41+0100 [fraistspider] INFO: Closing spider (finished)
2015-05-20 16:22:41+0100 [fraistspider] INFO: Stored csv feed (100 items) in: da
ta.csv
2015-05-20 16:22:41+0100 [fraistspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4425,
'downloader/request_count': 16,
'downloader/request_method_count/GET': 16,
'downloader/response_bytes': 126915,
'downloader/response_count': 16,
'downloader/response_status_count/200': 11,
'downloader/response_status_count/301': 5,
'dupefilter/filtered': 41,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 20, 15, 22, 41, 738000),
'item_scraped_count': 100,
'log_count/DEBUG': 119,
'log_count/INFO': 8,
'request_depth_max': 5,
'response_received_count': 11,
'scheduler/dequeued': 16,
'scheduler/dequeued/memory': 16,
'scheduler/enqueued': 16,
'scheduler/enqueued/memory': 16,
'start_time': datetime.datetime(2015, 5, 20, 15, 22, 40, 718000)}
2015-05-20 16:22:41+0100 [fraistspider] INFO: Spider closed (finished)
“标题”:[u'设计一张索尔·巴斯风格的海报]]
2015-05-20 16:22:41+0100[fraistspider]调试:从
{'content':[u'帮助一家公司制定品牌战略
激动人心\xa0同时又令人生畏。这给了设计师一个机会
要对品牌产生巨大的视觉冲击,但需要徽标技能,请打印
nd digital design。如果您一直在犹豫是否加入99designs品牌Iden
奶头包大赛,这里有一个…。],
'链接':[u'http://99designs.com/designer-blog/2015/05/07/tips-brand-ide
ntity pack设计成功/'],
“标题”:[u'99designs\u2019成功的品牌标识包的提示d
esign']}
2015-05-20 16:22:41+0100[fraistspider]调试:重定向(301)到
2015-05-20 16:22:41+0100[fraistspider]调试:重定向(301)到
2015-05-20 16:22:41+0100[fraistspider]信息:关闭卡盘(已完成)
2015-05-20 16:22:41+0100[fraistspider]信息:存储在da中的csv提要(100项)
ta.csv
2015-05-20 16:22:41+0100[fraistspider]信息:倾倒碎屑统计数据:
{'downloader/request_bytes':4425,
“下载程序/请求计数”:16,
“下载程序/请求方法/计数/获取”:16,
“downloader/response_字节”:126915,
“下载程序/响应计数”:16,
“下载/响应状态\计数/200”:11,
“下载/响应状态\计数/301”:5,
“dupefilter/filtered”:41,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2015,5,20,15,22,41738000),
“物料刮擦计数”:100,
“日志计数/调试”:119,
“日志计数/信息”:8,
“请求深度最大值”:5,
“收到的响应数”:11,
“调度程序/出列”:16,
“调度程序/出列/内存”:16,
“调度程序/排队”:16,
“调度程序/排队/内存”:16,
“开始时间”:datetime.datetime(2015,5,20,15,22,40718000)}
2015-05-20 16:22:41+0100[fraistspider]信息:Spider关闭(完成)
正如您所看到的,“项目刮取计数”是100,尽管它应该更多,因为总共有122页,每页10篇文章
从输出中,我可以看到301重定向问题,但我不明白这为什么会导致问题。我尝试了另一种方法来重写我的spider代码,但在同一部分的几个条目之后,它再次中断
任何帮助都将不胜感激。谢谢大家! 似乎达到了中定义的默认100项
在本例中,我将使用一个来抓取多个页面,因此您必须定义一个与99designs.com中的页面匹配的,并稍微修改您的解析函数以处理该项
C&p公司:
编辑:我刚刚发现其中包含一个有用的示例。谢谢您的回复。这很有帮助。我已经设法改写了我的蜘蛛有点,但不幸的是,它爬行其他链接(如),所以它不会考虑我的规则,出于某种原因。这是我的新代码:我不完全确定我是否在应该定义的地方定义了links变量。编辑:很快,它将通过所有内部链接(站点范围),而不是来自页面导航的链接。嗨,阿德里安。是的,这是因为您没有正确定义规则,在我粘贴的示例中,规则将匹配包含category.php的页面,但您放置了allow(“”),它基本上允许scraper访问站点上的任何页面。放一个匹配“page//”或类似内容的正则表达式。
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item