Python 使用scrapy进行1级向下爬行_Python_Scrapy

Python 使用scrapy进行1级向下爬行

python scrapy

Python 使用scrapy进行1级向下爬行,python,scrapy,Python,Scrapy,我对Python非常陌生。我试图打印（并保存）所有的博客文章在一个网站上使用刮刮。我希望蜘蛛只在主要内容部分爬行。这是我的密码 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.http impor

我对Python非常陌生。我试图打印（并保存）所有的博客文章在一个网站上使用刮刮。我希望蜘蛛只在主要内容部分爬行。这是我的密码

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from people.items import PeopleCommentItem

class people(CrawlSpider):
name="people"
  allowed_domains=["http://blog.sina.com.cn/"]
  start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
  rules=[Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)),  callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
  Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="articalContent   "]//a/@href')))
  ]

  def parse(self,response):
      hxs=HtmlXPathSelector(response)
      print hxs.select('//div[@class="articalContent   "]//a/text()').extract()

在下列情况下不打印任何内容：

DEBUG: Crawled (200) <GET http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html> (referer: None)
ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
  hxs=HtmlXPathSelector(response)
ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  titles= hxs.select('//div[@class="articalContent   "]//a/text()').extract()
2015-03-09 15:46:47-0700 [people] INFO: Closing spider (finished)

DEBUG:已爬网（200）（引用者：无）
ScrapyPrecisionWarning:scrapy.selector.HtmlXPathSelector已弃用，请实例化scrapy.selector。
hxs=HtmlXPathSelector（响应）
ScrapyPreactionWarning:调用不推荐使用的函数select。改为使用.xpath（）。
titles=hxs.select（'//div[@class=“articalContent”]//a/text（））.extract（）
2015-03-09 15:46:47-0700[人物]信息：关闭蜘蛛（已完成）

有人能告诉我出了什么问题吗

谢谢

我在这方面取得了一些成功：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request

class people(CrawlSpider):
  name="people"
  allowed_domains=["http://blog.sina.com.cn/"]
  start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
  rules=(Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)),  callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
  Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "articalContent")]'))),
  )

  def parse(self,response):
      links = Selector(text=response.body).xpath('//div[contains(@class, "articalContent")]//a//text()')
      for link in links:
          print link.extract()