使用Scrapy和Python进行分页刮削
我正在尝试从一个网站的所有条目和可用内容中获取信息,以尝试使用scrapy进行学习。到目前为止,我已经能够在一个页面上刮取所有的博客条目,然后转到下一个页面,刮取其中的内容。我还找到了下一页的链接。然而,即使我已经阅读了很多教程并查看了示例代码,我也不知道如何从那里开始处理。迄今为止,我已:使用Scrapy和Python进行分页刮削,python,scrapy,screen-scraping,Python,Scrapy,Screen Scraping,我正在尝试从一个网站的所有条目和可用内容中获取信息,以尝试使用scrapy进行学习。到目前为止,我已经能够在一个页面上刮取所有的博客条目,然后转到下一个页面,刮取其中的内容。我还找到了下一页的链接。然而,即使我已经阅读了很多教程并查看了示例代码,我也不知道如何从那里开始处理。迄今为止,我已: class SaltandLavender(CrawlSpider): logging.getLogger('scrapy').propagate = False name = 'salta
class SaltandLavender(CrawlSpider):
logging.getLogger('scrapy').propagate = False
name = 'saltandlavender'
allowed_domains=['saltandlavender.com']
start_urls=['https://www.saltandlavender.com/category/recipes/']
rules = (
Rule(LinkExtractor(allow='https://www.saltandlavender.com/category/recipes/'), callback="parse", follow= True),
)
def parse(self,response):
#with open('page.html', 'wb') as html_file:
# html_file.write(response.body)
print "start 1"
for href in response.css('.entry-title a'):
print "middle 1"
yield response.follow(href, callback=self.process_page)
next=response.css('li.pagination-next a::text')
if next:
url=''.join(response.css('li.pagination-next a::attr(href)').extract())
print url
Request(url)
def process_page(self,response):
print "start 2"
post_images=response.css('div.entry-content img::attr(src)').extract()
content = {
'cuisine':''.join(response.xpath(".//span[@class='wprm-recipe-cuisine']/descendant::text()").extract()),
'title': ''.join(response.css('article.format-standard h1.entry-title::text').extract()),
#'content': response.xpath(".//div[@class='entry-content']/descendant::text()").extract(),
'ingredients': ''.join(response.css('div.wprm-recipe-ingredients-container div.wprm-recipe-ingredient-group').extract()),
#'time':response.css('wprm-recipe-total-time-container'),
'servings':''.join(response.css('span.wprm-recipe-servings::text').extract()),
'course':''.join(response.css('span.wprm-recipe-course::text').extract()),
'preparation':''.join(response.css('span.wprm-recipe-servings-name::text').extract()),
'url':''.join(response.url),
'postimage':''.join(post_images[1])
}
#print content
print "end 2"
def errorCatch(self):
print "Script encountered an error. Check selectors for changes in the site's layout and design..."
return
def updateValid(self):
return
if __name__ == "__main__":
LOG_ENABLED = False
process = CrawlerProcess({
#random.choice(useragent)
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(SaltandLavender)
process.start()
您的下一页请求有问题。例如,您使用下一个变量,即内置的保留字,并且不会产生下一个请求。检查此修复程序:
def parse(self,response):
for href in response.css('.entry-title a'):
yield response.follow(href, callback=self.process_page)
next_page = response.css('li.pagination-next a::attr(href)').get()
if next_page:
yield response.follow(next_page)
您需要生成请求,而不仅仅是创建它的实例 替换: 请求URL 与: 屈服请求URL
谢谢,这很有效。我是一个胖乎乎的新手,老实说,我一直在努力计算出与scrapy相关的收益率,因为在此之前我从未见过它。因此,当您生成response.follow时,它将始终返回并再次遍历解析函数?是的,默认情况下它将再次调用解析函数。但您可以传递另一个回调函数,如这里所示:yield response.follownext\u page,self.other\u parse\u函数