Web 如何用Scrapy爬行整个网站?
我无法爬行整个网站,Scrapy只能在表面爬行,我想爬得更深。过去5-6个小时一直在谷歌上搜索,但没有任何帮助。我的代码如下:Web 如何用Scrapy爬行整个网站?,web,web-scraping,scrapy,Web,Web Scraping,Scrapy,我无法爬行整个网站,Scrapy只能在表面爬行,我想爬得更深。过去5-6个小时一直在谷歌上搜索,但没有任何帮助。我的代码如下: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item impor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
class ExampleSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
规则短路,这意味着链接满足的第一个规则将是应用的规则,第二个规则(带回调)将不会被调用 将规则更改为:
rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]
解析
start\u URL
时,标记href
可以解析更深的URL。然后,可以在函数parse()
中生成更深层次的请求。最重要的源代码如下所示:
from scrapy.spiders import Spider
from tutsplus.items import TutsplusItem
from scrapy.http import Request
import re
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["code.tutsplus.com"]
start_urls = ["http://code.tutsplus.com/"]
def parse(self, response):
links = response.xpath('//a/@href').extract()
# We stored already crawled links in this list
crawledLinks = []
# Pattern to check proper link
# I only want to get tutorial posts
linkPattern = re.compile("^\/tutorials\?page=\d+")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
if linkPattern.match(link) and not link in crawledLinks:
link = "http://code.tutsplus.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(@class, "posts__post-title")]/h1/text()').extract()
for title in titles:
item = TutsplusItem()
item["title"] = title
yield item
刚刚试过你的代码对抗stackoverflow-我的ip被禁止了。它确实有效!:)@Alexander-听起来鼓励我调试更多:):。。。对不起,我的IP禁令伙计!你真的在尝试抓取example.com吗?你知道那不是一个真正的网站。你想爬哪个网站?“example.com”只用于代表性目的。我在尝试爬地标。com@All-开始工作了。。。史蒂文是对的,谢谢你的帮助!但我无法抓取整个网站,只能抓取大约80多个页面。。有什么需要纠正的吗?这是我的工作版本:(Rule(SgmlLinkExtractor(allow=('pages/')),follow=True,callback='parse_item'),)嗨!你介意在这方面提供帮助吗?@Steven Almeroth Hi Steven你能帮忙吗?我尝试了规则中的更改,但对我无效。dmoz.org在href中没有任何带有“Items”的链接,因此你的规则找不到任何链接,这就是你的Items.json文件为空的原因。