用于JSON响应的Scrapy Spider
我正在尝试编写一个爬行器,通过以下JSON响应爬行: 如果我想抓取视频的所有标题,蜘蛛会是什么样子?我所有的蜘蛛都不工作用于JSON响应的Scrapy Spider,json,web-crawler,scrapy,Json,Web Crawler,Scrapy,我正在尝试编写一个爬行器,通过以下JSON响应爬行: 如果我想抓取视频的所有标题,蜘蛛会是什么样子?我所有的蜘蛛都不工作 from scrapy.spider import BaseSpider import json from youtube.items import YoutubeItem class MySpider(BaseSpider): name = "youtubecrawler" allowed_domains = ["gdata.youtube.com"]
from scrapy.spider import BaseSpider
import json
from youtube.items import YoutubeItem
class MySpider(BaseSpider):
name = "youtubecrawler"
allowed_domains = ["gdata.youtube.com"]
start_urls = ['http://www.gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json']
def parse(self, response):
items []
jsonresponse = json.loads(response)
for video in jsonresponse["feed"]["entry"]:
item = YoutubeItem()
print jsonresponse
print video["media$group"]["yt$videoid"]["$t"]
print video["media$group"]["media$description"]["$t"]
item ["title"] = video["title"]["$t"]
print video["author"][0]["name"]["$t"]
print video["category"][1]["term"]
items.append(item)
return items
我总是遇到以下错误:
2014-01-05 16:55:21+0100 [youtubecrawler] ERROR: Spider error processing <GET http://gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/bxxxx/svn/ba_txxxxx/scrapy/youtube/spiders/test.py", line 15, in parse
jsonresponse = json.loads(response)
File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
exceptions.TypeError: expected string or buffer
2014-01-05 16:55:21+0100[youtubecrawler]错误:蜘蛛错误处理
回溯(最近一次呼叫最后一次):
文件“/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py”,第1201行,在mainLoop中
self.rununtlcurrent()
文件“/usr/local/lib/python2.7/dist packages/twisted/internet/base.py”,第824行,在rununtlcurrent中
call.func(*call.args,**call.kw)
回调中的文件“/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py”,第382行
自启动返回(结果)
文件“/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py”,第490行,在startRunCallbacks中
self.\u runCallbacks()
--- ---
文件“/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py”,第577行,在运行回调中
current.result=回调(current.result,*args,**kw)
文件“/home/bxxxx/svn/ba_txxxxx/scrapy/youtube/spiders/test.py”,第15行,在语法分析中
jsonresponse=json.load(响应)
文件“/usr/lib/python2.7/json/_init__.py”,第326行,加载
返回\u默认\u解码器。解码
文件“/usr/lib/python2.7/json/decoder.py”,第365行,在decode中
obj,end=self.raw\u decode(s,idx=\u w(s,0.end())
exceptions.TypeError:应为字符串或缓冲区
在代码中发现两个问题:
www
json.loads(response)
更改为json.loads(response.body作为unicode())
class MySpider(BaseSpider):
name = "youtubecrawler"
allowed_domains = ["gdata.youtube.com"]
start_urls = ['http://gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json']
def parse(self, response):
items = []
jsonresponse = json.loads(response.body_as_unicode())
for video in jsonresponse["feed"]["entry"]:
item = YoutubeItem()
print video["media$group"]["yt$videoid"]["$t"]
print video["media$group"]["media$description"]["$t"]
item ["title"] = video["title"]["$t"]
print video["author"][0]["name"]["$t"]
print video["category"][1]["term"]
items.append(item)
return items
在代码中发现两个问题:
www
json.loads(response)
更改为json.loads(response.body作为unicode())
class MySpider(BaseSpider):
name = "youtubecrawler"
allowed_domains = ["gdata.youtube.com"]
start_urls = ['http://gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json']
def parse(self, response):
items = []
jsonresponse = json.loads(response.body_as_unicode())
for video in jsonresponse["feed"]["entry"]:
item = YoutubeItem()
print video["media$group"]["yt$videoid"]["$t"]
print video["media$group"]["media$description"]["$t"]
item ["title"] = video["title"]["$t"]
print video["author"][0]["name"]["$t"]
print video["category"][1]["term"]
items.append(item)
return items
body\u作为unicode
已被弃用,请参阅右图!感谢@timfeirg future readers,请使用响应。文本
代替正文,因为unicode已被弃用,请参阅右侧!感谢@timfeirg future readers,请改用response.text