Python 刮擦递归刮擦网站
我想写一个scraper来访问初始页面的所有子页面 示例网站是:pydro.com 例如,它还应该提取pydro.com/impressum并将其保存为硬盘上的html文件 我写的代码是:Python 刮擦递归刮擦网站,python,scrapy,web-crawler,scrapy-spider,Python,Scrapy,Web Crawler,Scrapy Spider,我想写一个scraper来访问初始页面的所有子页面 示例网站是:pydro.com 例如,它还应该提取pydro.com/impressum并将其保存为硬盘上的html文件 我写的代码是: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.exporters import CsvItemExporter fro
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader
from finalproject.items import FinalprojectItem
class ExampleSpider(CrawlSpider):
name = "projects" # Spider name
allowed_domains = ["pydro.com"] # Which (sub-)domains shall be scraped?
start_urls = ["https://pydro.com/"] # Start with this one
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
print('Got a response from %s.' % response.url)
self.logger.info('Hi this is an item page! %s', response.url)
page = response.url.split('.com/')[-1]
filename = 'pydro.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
当我运行spider时,输出仅为pydro.html
我想我需要调整我的文件名,以获得子页面。还是我需要一个for循环
编辑1:
我编辑代码以获取所有html页面。但是,当我想在另一个网站上运行脚本时,会出现一个错误,名为:
FileNotFoundError: [Errno 2] No such file or directory: 'otego-https://www.otego.de/de/jobs.php'
这就是我运行的脚本:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader
class ExampleSpider(CrawlSpider):
name = "otego" #Spider name
allowed_domains = ["otego.de"] # Which (sub-)domains shall be scraped?
start_urls = ["https://www.otego.de/en/index.php"] # Start with this one
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
print('Got a response from %s.' % response.url)
self.logger.info('Hi this is an item page! %s', response.url)
page = response.url
filename = 'otego-%s' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
您需要创建一个递归刮片。 “子页面”只是另一个页面,其url是从“上一个”页面获得的。您必须向子页面发出第二个请求,子页面的url应位于变量sel中,并在第二个响应中使用xpath
嗯,我想当我使用爬行蜘蛛时,我会递归地爬行。这就是为什么我添加了规则,我猜?我稍微调整了一下代码:page=response.url.split.com/'[-1]filename='pydro-%s.html''%page我看到了url,因为每个url都以.com/结尾,我在.com/之后拆分它并粘贴文件名后面的内容。现在我的硬盘上有了所有的html。现在我想把它应用到另一个网站上。不起作用。。。。