Web scraping NameError:使用Scrapy时未定义名称“hxs”_Web Scraping_Scrapy

Web scraping NameError:使用Scrapy时未定义名称“hxs”

web-scraping scrapy

Web scraping NameError:使用Scrapy时未定义名称“hxs”,web-scraping,scrapy,Web Scraping,Scrapy,我已经启动了Scrapy shell并成功地ping了维基百科刮壳http://en.wikipedia.org/wiki/Main_Page 从Scrapy冗长的回答来看，我相信这一步是正确的接下来，我想看看当我写作时会发生什么选择“/html”。提取此时，我得到一个错误： NameError:未定义名称“hxs” 有什么问题？我知道Scrapy安装得很好，已经接受了目标的URL，但是为什么hxs命令会出现问题？我怀疑您使用的是Scrapy版本，它的外壳上不再有hxs 使用0.24之后

我已经启动了Scrapy shell并成功地ping了维基百科

刮壳http://en.wikipedia.org/wiki/Main_Page

从Scrapy冗长的回答来看，我相信这一步是正确的

接下来，我想看看当我写作时会发生什么

选择“/html”。提取

此时，我得到一个错误：

NameError:未定义名称“hxs”

有什么问题？我知道Scrapy安装得很好，已经接受了目标的URL，但是为什么hxs命令会出现问题？

我怀疑您使用的是Scrapy版本，它的外壳上不再有hxs

使用0.24之后不推荐使用的sel，请参见以下内容：

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> sel.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

或者，从Scrapy 1.0开始，您应该使用响应的Selector对象及其.xpath和.css便利方法：

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> response.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

仅供参考，请在Scrapy文档中引用：

。。。加载shell之后，您将获得响应作为response shell变量，以及response.selector属性中附加的选择器。 ... 使用XPath和CSS查询响应非常常见，响应包括两个方便的快捷方式：response.XPath和response.CSS：

>>>response.xpath'//title/text' [] >>>css'title:：text' []

你应该利用Scrapy反应的冗长性

如果您的详细信息如下所示：

2014-09-20 23:02:14-0400 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2014-09-20 23:02:14-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled item pipelines: 
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-09-20 23:02:15-0400 [default] INFO: Spider opened
2014-09-20 23:02:15-0400 [default] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html lang="en" dir="ltr" class="client-'>
[s]   item       {}
[s]   request    <GET http://en.wikipedia.org/wiki/Main_Page>
[s]   response   <200 http://en.wikipedia.org/wiki/Main_Page>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0xb5d95d8c>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.6 (default, Mar 22 2014, 22:59:38) 
Type "copyright", "credits" or "license" for more information.

您的详细信息将显示可用的碎片对象

因此，hxs或sel取决于您在详细信息中显示的内容。对于您的情况，hxs不可用，因此您需要使用“sel”更新的旧版本。因此，对于某些HX是可以使用的，而其他的sel是他们需要使用的

sel快捷方式不推荐使用，您应该使用response.xpath'/html'。提取

非常有效。非常感谢你让我注意到这一点。奇怪的是，几分钟前它有+2。看起来它当时得到了2张反对票，而且只存活了大约10分钟……而且只显示了4张视图@马托·布莱恩：这是因为投赞成票的人决定投反对票。这没关系，它发生了，一个理想的情况是获得反馈，这样我就可以改进答案，但事实就是这样。谢谢。我对答案投了更高的票，而我的票后来被取消了……所以有两个不满意的人……0.14.4已经超过2年了，为什么不降到0.7呢是@alecxe你说得对：我应该一直使用最新的版本，但Scrapy 0.7是我现在的版本。好吧，即使它被弃用，它在shell中也不会出现任何问题。

2014-09-20 23:02:14-0400 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2014-09-20 23:02:14-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled item pipelines: 
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-09-20 23:02:15-0400 [default] INFO: Spider opened
2014-09-20 23:02:15-0400 [default] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html lang="en" dir="ltr" class="client-'>
[s]   item       {}
[s]   request    <GET http://en.wikipedia.org/wiki/Main_Page>
[s]   response   <200 http://en.wikipedia.org/wiki/Main_Page>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0xb5d95d8c>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.6 (default, Mar 22 2014, 22:59:38) 
Type "copyright", "credits" or "license" for more information.