Python 无法对scrapy.Response调用css()或xpath()
我试图用Python 无法对scrapy.Response调用css()或xpath(),python,scrapy,Python,Scrapy,我试图用scrapy编写一个网络爬虫。然而,当我试图使用它的交互式shell进行测试时 错误消息 2016-03-01 22:15:08 [scrapy] INFO: Scrapy 1.0.5 started (bot: momo) 2016-03-01 22:15:08 [scrapy] INFO: Optional features available: ssl, http11 2016-03-01 22:15:08 [scrapy] INFO: Overridden settings: {
scrapy
编写一个网络爬虫。然而,当我试图使用它的交互式shell进行测试时
错误消息
2016-03-01 22:15:08 [scrapy] INFO: Scrapy 1.0.5 started (bot: momo)
2016-03-01 22:15:08 [scrapy] INFO: Optional features available: ssl, http11
2016-03-01 22:15:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'momo.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['momo.spiders'], 'FEED_URI': 'j.json', 'BOT_NAME': 'momo'}
2016-03-01 22:15:08 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-01 22:15:08 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-01 22:15:08 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-01 22:15:08 [scrapy] INFO: Enabled item pipelines:
2016-03-01 22:15:08 [scrapy] INFO: Spider opened
2016-03-01 22:15:08 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-01 22:15:08 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-01 22:15:09 [scrapy] DEBUG: Crawled (200) <GET http://www.momoshop.com.tw/main/Main.jsp> (referer: None)
2016-03-01 22:15:11 [scrapy] DEBUG: Crawled (200) <GET http://www.momoshop.com.tw/goods/GoodsDetail.jsp?i_code=1697199&str_category_code=2200700058&cid=ec&oid=1c&mdiv=1000000000-bt_0_209_01-bt_0_209_01_e11&ctype=B> (referer: http://www.momoshop.com.tw/main/Main.jsp)
2016-03-01 22:15:11 [scrapy] DEBUG: Crawled (200) <GET http://www.momoshop.com.tw/goods/GoodsDetail.jsp?i_code=3753480&str_category_code=1514200303&cid=ec&oid=2a&mdiv=1000000000-bt_0_209_01-bt_0_209_01_e25&ctype=B> (referer: http://www.momoshop.com.tw/main/Main.jsp)
2016-03-01 22:15:11 [scrapy] DEBUG: Crawled (200) <GET http://www.momoshop.com.tw/goods/GoodsDetail.jsp?i_code=3754704&str_category_code=1417802005&cid=ec&oid=1f&mdiv=1000000000-bt_0_209_01-bt_0_209_01_e20&ctype=B> (referer: http://www.momoshop.com.tw/main/Main.jsp)
2016-03-01 22:15:11 [scrapy] DEBUG: Crawled (200) <GET http://www.momoshop.com.tw/goods/GoodsDetail.jsp?i_code=3811447&str_category_code=1318900078&cid=ec&oid=1d&mdiv=1000000000-bt_0_209_01-bt_0_209_01_e14&ctype=B> (referer: http://www.momoshop.com.tw/main/Main.jsp)
{'Date': ['Tue, 01 Mar 2016 14:15:10 GMT'], 'Set-Cookie': ['loginRsult=null;Expires=Thu, 01-Jan-01970 00:00:10 GMT;Path=/', 'loginUser=null;Expires=Thu, 01-Jan-01970 00:00:10 GMT;Path=/', 'cardUser=null;Expires=Thu, 01-Jan-01970 00:00:10 GMT;Path=/', '18YEARAGREE=null;Expires=Thu, 01-Jan-01970 00:00:10 GMT;Path=/', 'Browsehist=1697199,3753480,3754704,2189725;Path=/', 'FTOOTH=22;Path=/', 'DCODE=2200700058;Path=/'], 'Content-Type': ['']}
2016-03-01 22:15:11 [scrapy] ERROR: Spider error processing <GET http://www.momoshop.com.tw/goods/GoodsDetail.jsp?i_code=1697199&str_category_code=2200700058&cid=ec&oid=1c&mdiv=1000000000-bt_0_209_01-bt_0_209_01_e11&ctype=B> (referer: http://www.momoshop.com.tw/main/Main.jsp)
Traceback (most recent call last):
File "/Users/Shane/Desktop/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/Shane/Desktop/scrapy/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/Users/Shane/Desktop/scrapy/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Users/Shane/Desktop/scrapy/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/Shane/Desktop/scrapy/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/Shane/Desktop/scrapy/momo/momo/spiders/default_spider.py", line 35, in parseGoods
item.item = response.css('h1').extract()
AttributeError: 'Response' object has no attribute 'css'
对杰德回答的回应 你在用什么
从scrapy.selector导入选择器
在你的机器人里?机器人的代码会很有用。此外,这些应该是对象仅有的属性。
css()只是一个方便的函数。()
EDÍT:
问题在于回调。回调函数需要参数self.parseGoods(response),否则在函数parseGoods上使用.css()。在我的笔记本电脑上工作
编辑:
def parse(self, response):
for href in response.xpath('//a[contains(@href, "/goods")]/@href'):
url = response.urljoin(href.extract())
self.parseGoods(response)
yield Request(url, callback=self.parseGoods(response))
# for href in response.xpath('//a[contains(@href, "/category")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
#
# for href in response.xpath('//a[contains(@href, "/brand")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
def parseGoods(self, response):
item = MomoItem()
print(response.headers)
item['item'] = response.css('h1').extract()
item['price'] = response.xpath('//ul[@class="prdPrice"]/li/span/text()').extract()
print(item)
return item
好吧,这应该行得通。
将收益项目更改为返回,并更改了项目调用,例如,将item.item改为item['item']
试试看,告诉我s.th。错误问题在于默认的
scrapy
HTML解析器。一旦我切换到另一个解析器,它就像一个符咒lxml
似乎不像BeautifulSoup4那样完美地解析破损的HTML
from bs4 import BeautifulSoup
import scrapy
class MomoItem(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
# specification = scrapy.Field()
class MomoSpider(scrapy.Spider):
name = "momo"
allowed_domains = ["www.momoshop.com.tw"]
start_urls = ["http://www.momoshop.com.tw/main/Main.jsp"]
def parse(self, response):
for href in response.xpath('//a[contains(@href, "/goods")]/@href'):
url = response.urljoin(href.extract())
self.parseGoods(response)
yield scrapy.Request(url, callback=self.parseGoods)
# for href in response.xpath('//a[contains(@href, "/category")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
#
# for href in response.xpath('//a[contains(@href, "/brand")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
def parseGoods(self, response):
soup = BeautifulSoup(response._body, 'html.parser')
item = MomoItem()
item['item'] = soup.find_all('h1')[0].get_text()
item['price'] = soup.find_all('ul', class_='prdPrice')[0]
.find_all('li', class_='special')[0].span.get_text()
yield item
你是对的,这些是唯一应该存在的属性。但是为什么我不能调用
css()
或xpath()
?在bot中导入scrapy后添加“from scrapy.selector import selector”是否有效?尝试将yield Request(url,callback=self.parseGoods)更改为yield Request(url,callback=self.parseGoods(self,response))你能分享你的代码吗,我不能让它与yield Request一起工作(url,callback=self.parseGoods(self,response))
。根据它,我的似乎是正确的。是的,我的错v.v它没有self,但又出现了一个错误,请检查编辑的答案
def parse(self, response):
for href in response.xpath('//a[contains(@href, "/goods")]/@href'):
url = response.urljoin(href.extract())
self.parseGoods(response)
yield Request(url, callback=self.parseGoods(response))
# for href in response.xpath('//a[contains(@href, "/category")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
#
# for href in response.xpath('//a[contains(@href, "/brand")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
def parseGoods(self, response):
item = MomoItem()
print(response.headers)
item['item'] = response.css('h1').extract()
item['price'] = response.xpath('//ul[@class="prdPrice"]/li/span/text()').extract()
print(item)
return item
from bs4 import BeautifulSoup
import scrapy
class MomoItem(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
# specification = scrapy.Field()
class MomoSpider(scrapy.Spider):
name = "momo"
allowed_domains = ["www.momoshop.com.tw"]
start_urls = ["http://www.momoshop.com.tw/main/Main.jsp"]
def parse(self, response):
for href in response.xpath('//a[contains(@href, "/goods")]/@href'):
url = response.urljoin(href.extract())
self.parseGoods(response)
yield scrapy.Request(url, callback=self.parseGoods)
# for href in response.xpath('//a[contains(@href, "/category")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
#
# for href in response.xpath('//a[contains(@href, "/brand")]'):
# url = response.urljoin(href.extract())
# yield scrapy.Request(url, callback=self.parse)
def parseGoods(self, response):
soup = BeautifulSoup(response._body, 'html.parser')
item = MomoItem()
item['item'] = soup.find_all('h1')[0].get_text()
item['price'] = soup.find_all('ul', class_='prdPrice')[0]
.find_all('li', class_='special')[0].span.get_text()
yield item