Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/348.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy:仅解析带有meta noindex的页面_Python_Web Crawler_Scrapy - Fatal编程技术网

Python Scrapy:仅解析带有meta noindex的页面

Python Scrapy:仅解析带有meta noindex的页面,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,我正在尝试抓取一个网站,并仅从带有meta-noindex的页面进行解析。 所发生的事情是,爬虫爬到第一层,但以第一页结束。它似乎没有遵循链接。 以下是我的代码: from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor fro

我正在尝试抓取一个网站,并仅从带有meta-noindex的页面进行解析。 所发生的事情是,爬虫爬到第一层,但以第一页结束。它似乎没有遵循链接。 以下是我的代码:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class mydomainSpider(CrawlSpider):
    name = "0resultsTest"
    allowed_domains = ["www.mydomain.com"]
    start_urls = ["http://www.mydomain.com/cp/3944"]

    rules = (
    Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
    )

    def _response_downloaded(self, response):
        sel = HtmlXPathSelector(response)
        if sel.xpath('//meta[@content="noindex"]'):
            return super(mydomainSpider, self).parse_items(response)
        return

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        for site in sites:
            item = Website()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['title'] = site.xpath('/html/head/title/text()').extract()
            item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
            items.append(item)

        yield items

下载的原始
\u response\u调用
\u parse\u response
函数,该函数除了调用
回调
函数外,还遵循来自scrapy code的链接:

def _parse_response(self, response, callback, cb_kwargs, follow=True):
    if callback:
        cb_res = callback(response, **cb_kwargs) or ()
        cb_res = self.process_results(response, cb_res)
        for requests_or_item in iterate_spider_output(cb_res):
            yield requests_or_item

    if follow and self._follow_links:
        for request_or_item in self._requests_to_follow(response):
            yield request_or_item

您可以添加follow链接部分,尽管我认为这不是最好的方法(leading
\uuu
可能意味着这一点),为什么不在
parse_items
函数的开头检查
meta
?如果你不想重复这个测试,甚至可以编写一个python装饰程序。

我相信,按照@Guy Gavriely的建议,在我的parse_项开始时检查meta将是我最好的选择。我将测试下面的代码以查看

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class mydomainSpider(CrawlSpider):
    name = "0resultsTest"
    allowed_domains = ["www.mydomain.com"]
    start_urls = ["http://www.mydomain.com/cp/3944"]

    rules = (
    Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        if hxs.xpath('//meta[@content="noindex"]'):
            for site in sites:
                item = Website()
                item['url'] = response.url
                item['referer'] = response.request.headers.get('Referer')
                item['title'] = site.xpath('/html/head/title/text()').extract()
                item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
                items.append(item)

            yield items
工作代码更新,我需要返回项目,而不是屈服:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class mydomainSpider(CrawlSpider):
    name = "0resultsTest"
    allowed_domains = ["www.mydomain.com"]
    start_urls = ["http://www.mydomain.com/cp/3944"]

    rules = (
    Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        if hxs.xpath('//meta[@content="noindex"]'):
            for site in sites:
                item = Website()
                item['url'] = response.url
                item['referer'] = response.request.headers.get('Referer')
                item['title'] = site.xpath('/html/head/title/text()').extract()
                item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
                items.append(item)

            return items

在我的parse_项的开头检查meta似乎是最简单的方法。我会试试的,再次谢谢你,伙计!我下面的代码似乎没有解析任何url,我在解析之前是否正确地检查了Meta?不,您的代码看起来不错,请尝试添加打印/日志以进行调试,例如
print response。url
就在
parse\u items
函数的开头找到了错误-
错误:Spider必须返回请求,BaseItem或None,在“列表”中获得“列表”可以逐项生成,也可以将它们累积到列表中并返回,但不能生成列表,在我看来,生成y项更好