Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/351.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy Spider未使用SgmlLinkExtractor规则进入parse_项方法_Python_Web Crawler_Scrapy_Scrapy Spider - Fatal编程技术网

Python Scrapy Spider未使用SgmlLinkExtractor规则进入parse_项方法

Python Scrapy Spider未使用SgmlLinkExtractor规则进入parse_项方法,python,web-crawler,scrapy,scrapy-spider,Python,Web Crawler,Scrapy,Scrapy Spider,我正在制作一个爬虫来递归地抓取网站,但问题是蜘蛛没有进入parse_item方法。我的蜘蛛的名字是example.py。代码如下: from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlX

我正在制作一个爬虫来递归地抓取网站,但问题是蜘蛛没有进入parse_item方法。我的蜘蛛的名字是example.py。代码如下:

from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy.utils.response import get_base_url


class CrawlSpider(CrawlSpider):
    name = "example"
    download_delay = 2
    allowed_domains = ["dmoz.org"]
    print allowed_domains
    start_urls = [
        "http://www.dmoz.org/Arts/"
    ]
    print start_urls
    rules = (
    Rule(SgmlLinkExtractor(allow=('/Arts', )), callback='parse_item',follow=True),
      )

#The spide is not entering into this parse_item

    def parse_item(self, response):
        print "hello parse"
        sel = Selector(response)
        title = sel.xpath('//title/text()').extract()
        print title

为什么要显式定义和调用函数? 试试这个:

class CrawlSpider(CrawlSpider):
   name = "example"
   download_delay = 2
   allowed_domains = ["dmoz.org"]
   print allowed_domains
   start_urls = ["http://www.dmoz.org/Arts/"]

   def parse(self, response):
      print "hello parse"
      sel = Selector(response)
      title = sel.xpath('//title/text()').extract()
      print title

是url吗http://www.tutorial.com/tutorials/steps 对的我在网络浏览器中打开了它,被重定向到http://www.tutorial.com/?f. 此外,页面上的href.no!中没有与/tutorials的链接!这是一个任意的URL,我在给出示例时提到过。但是如果我用真正的URL替换它,它仍然不起作用。虽然我在同一个项目中还有一个spider,并且parse_项在那里工作得很好,有相同的代码。@rohan我想你在allow中使用了错误的regex。。你能举几个例子来说明你想刮取的链接吗?@vipul我已经更新了代码,请检查一下。如果你想了解Arts中的所有链接,那么你应该在allow中使用regex作为allow='/Arts/*'。这将匹配www.dmoz.org/Arts/之后的所有内容。。e、 g.www.dmoz.org/Arts/Movies等等