Python 使用scrapy进行1级向下爬行
我对Python非常陌生。我试图打印(并保存)所有的博客文章在一个网站上使用刮刮。我希望蜘蛛只在主要内容部分爬行。这是我的密码Python 使用scrapy进行1级向下爬行,python,scrapy,Python,Scrapy,我对Python非常陌生。我试图打印(并保存)所有的博客文章在一个网站上使用刮刮。我希望蜘蛛只在主要内容部分爬行。这是我的密码 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.http impor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from people.items import PeopleCommentItem
class people(CrawlSpider):
name="people"
allowed_domains=["http://blog.sina.com.cn/"]
start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
rules=[Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)), callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="articalContent "]//a/@href')))
]
def parse(self,response):
hxs=HtmlXPathSelector(response)
print hxs.select('//div[@class="articalContent "]//a/text()').extract()
在下列情况下不打印任何内容:
DEBUG: Crawled (200) <GET http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html> (referer: None)
ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
hxs=HtmlXPathSelector(response)
ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
titles= hxs.select('//div[@class="articalContent "]//a/text()').extract()
2015-03-09 15:46:47-0700 [people] INFO: Closing spider (finished)
DEBUG:已爬网(200)(引用者:无)
ScrapyPrecisionWarning:scrapy.selector.HtmlXPathSelector已弃用,请实例化scrapy.selector。
hxs=HtmlXPathSelector(响应)
ScrapyPreactionWarning:调用不推荐使用的函数select。改为使用.xpath()。
titles=hxs.select('//div[@class=“articalContent”]//a/text()).extract()
2015-03-09 15:46:47-0700[人物]信息:关闭蜘蛛(已完成)
有人能告诉我出了什么问题吗
谢谢 我在这方面取得了一些成功:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
class people(CrawlSpider):
name="people"
allowed_domains=["http://blog.sina.com.cn/"]
start_urls=["http://blog.sina.com.cn/s/blog_53d7b5ce0100e7y0.html"]
rules=(Rule(SgmlLinkExtractor(allow=("http://blog.sina.com.cn/",)), callback='parse_item', follow=True),
#restrict the crawling in the articalContent section only
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "articalContent")]'))),
)
def parse(self,response):
links = Selector(text=response.body).xpath('//div[contains(@class, "articalContent")]//a//text()')
for link in links:
print link.extract()