Python Scrapy:仅解析带有meta noindex的页面
我正在尝试抓取一个网站,并仅从带有meta-noindex的页面进行解析。 所发生的事情是,爬虫爬到第一层,但以第一页结束。它似乎没有遵循链接。 以下是我的代码:Python Scrapy:仅解析带有meta noindex的页面,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,我正在尝试抓取一个网站,并仅从带有meta-noindex的页面进行解析。 所发生的事情是,爬虫爬到第一层,但以第一页结束。它似乎没有遵循链接。 以下是我的代码: from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor fro
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "0resultsTest"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com/cp/3944"]
rules = (
Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
)
def _response_downloaded(self, response):
sel = HtmlXPathSelector(response)
if sel.xpath('//meta[@content="noindex"]'):
return super(mydomainSpider, self).parse_items(response)
return
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['title'] = site.xpath('/html/head/title/text()').extract()
item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
items.append(item)
yield items
下载的原始
\u response\u调用\u parse\u response
函数,该函数除了调用回调
函数外,还遵循来自scrapy code的链接:
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
您可以添加follow链接部分,尽管我认为这不是最好的方法(leading\uuu
可能意味着这一点),为什么不在parse_items
函数的开头检查meta
?如果你不想重复这个测试,甚至可以编写一个python装饰程序。我相信,按照@Guy Gavriely的建议,在我的parse_项开始时检查meta将是我最好的选择。我将测试下面的代码以查看
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "0resultsTest"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com/cp/3944"]
rules = (
Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
if hxs.xpath('//meta[@content="noindex"]'):
for site in sites:
item = Website()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['title'] = site.xpath('/html/head/title/text()').extract()
item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
items.append(item)
yield items
工作代码更新,我需要返回项目,而不是屈服:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "0resultsTest"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com/cp/3944"]
rules = (
Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
if hxs.xpath('//meta[@content="noindex"]'):
for site in sites:
item = Website()
item['url'] = response.url
item['referer'] = response.request.headers.get('Referer')
item['title'] = site.xpath('/html/head/title/text()').extract()
item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
items.append(item)
return items
在我的parse_项的开头检查meta似乎是最简单的方法。我会试试的,再次谢谢你,伙计!我下面的代码似乎没有解析任何url,我在解析之前是否正确地检查了Meta?不,您的代码看起来不错,请尝试添加打印/日志以进行调试,例如print response。url
就在parse\u items
函数的开头找到了错误-错误:Spider必须返回请求,BaseItem或None,在“列表”中获得“列表”可以逐项生成,也可以将它们累积到列表中并返回,但不能生成列表,在我看来,生成y项更好