Python 3.x Scrapy不会递归地抓取所有链接
我需要所有的内部链接从所有网页在网站上进行分析。我搜索了很多类似的问题。 我找到了这个代码,它给出了可能的答案。然而,这并不是从页面深度的第二级提供所有可能的链接。 生成的文件只有676条记录,但网站有1000条记录 工作代码Python 3.x Scrapy不会递归地抓取所有链接,python-3.x,scrapy,scrapy-spider,Python 3.x,Scrapy,Scrapy Spider,我需要所有的内部链接从所有网页在网站上进行分析。我搜索了很多类似的问题。 我找到了这个代码,它给出了可能的答案。然而,这并不是从页面深度的第二级提供所有可能的链接。 生成的文件只有676条记录,但网站有1000条记录 工作代码 import csv // Done to avoid line gaps in the generated csv file import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.
import csv // Done to avoid line gaps in the generated csv file
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from eylinks.items import LinkscrawlItem
outfile = open("data.csv", "w", newline='')
writer = csv.writer(outfile)
class ToscrapeSpider(scrapy.Spider):
name = "toscrapesp"
start_urls = ["http://books.toscrape.com/"]
rules = ([Rule(LinkExtractor(allow=r".*"), callback='parse', follow=True)])
def parse(self, response):
extractor = LinkExtractor(allow_domains='toscrape.com')
links = extractor.extract_links(response)
for link in links:
yield scrapy.Request(link.url, callback=self.collect_data)
def collect_data(self, response):
global writer
for item in response.css('.product_pod'):
product = item.css('h3 a::text').extract_first()
value = item.css('.price_color::text').extract_first()
lnk = response.url
stats = response.status
print(lnk)
yield {'Name': product, 'Price': value,"URL":lnk,"Status":stats}
writer.writerow([product,value,lnk,stats])
要获取提取链接,请尝试以下操作:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import csv
outfile = open("data.csv", "w", newline='')
writer = csv.writer(outfile)
class BooksScrapySpider(scrapy.Spider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
books = response.xpath('//h3/a/@href').extract()
for book in books:
url = response.urljoin(book)
yield Request(url, callback=self.parse_book)
next_page_url = response.xpath(
"//a[text()='next']/@href").extract_first()
absolute_next_page = response.urljoin(next_page_url)
yield Request(absolute_next_page)
def parse_book(self, response):
title = response.css("h1::text").extract_first()
price = response.xpath(
"//*[@class='price_color']/text()").extract_first()
url = response.request.url
yield {'title': title,
'price': price,
'url': url,
'status': response.status}
writer.writerow([title,price,url,response.status])
你的代码工作得很疯狂。谢谢你的指导。然而,我的最终目标是只获取网站的URL、标题和状态,以跟踪所有无效链接,因此最终下一页链接将无法与我一起工作:(.将尝试编辑此代码以满足我的需要。据我所知,我最近两天的全天研究表明,我必须使用scrapy.LinkedExtractor导入LinkedExtractor来实现这一点。如果您能在这方面提供帮助,我将不胜感激。还有,您知道为什么LinkedExtractor会排除某些链接吗?