python scrapy从网站中提取数据

python scrapy从网站中提取数据,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我想从中提取数据。这是我目前的代码: buf = cStringIO.StringIO() c = pycurl.Curl() c.setopt(c.URL, "http://www.guardalo.org/99407/") c.setopt(c.VERBOSE, 0) c.setopt(c.WRITEFUNCTION, buf.write) c.setopt(c.CONNECTTIMEOUT, 15) c.setopt(c.TIMEOUT, 15) c.setopt(c.SSL_VERIFY

我想从中提取数据。这是我目前的代码:

buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, "http://www.guardalo.org/99407/")
c.setopt(c.VERBOSE, 0)
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 15)
c.setopt(c.TIMEOUT, 15)
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.perform()
body = buf.getvalue()
c.close()

response = HtmlResponse(url='http://www.guardalo.org/99407/', body=body)
print Selector(response=response).xpath('//edindex/text()').extract()

它的工作,但我需要标题,视频链接和描述作为单独的变量。如何实现这一点?

可以使用
//Title/text()
,通过
//video/source/@src
的视频源链接来提取标题:

selector = Selector(response=response)

title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]

code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)

print title
print description
print video_sources
print code
印刷品:

Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8

无需使用
scrapy
获取单个URL——只需使用更简单的工具(甚至最简单的
urllib.urlopen(theurl.read()
!)获取单个页面的HTML,然后使用BeautifulSoup等工具分析HTML。从一个简单的“查看源”来看,它看起来像是您正在寻找的:

<title>Best Babies Laughing Video Compilation 2012 [HD] - Guardalo</title>

等等(但你必须选择一个视频链接——我在他们的来源中看到他们被称为“预售”,所以可能是指向实际非广告视频的链接实际上不在页面上,但只有在登录或其他情况下才能访问)。

视频需要捕获此代码:L49VXZwfup8,是来自youtube的代码视频@pythoncoder好的,更新了答案,这就是你要问的吗?谢谢。@pythoncoder还注意到Alex Martelli在这里有一个有效的观点-如果您使用Scrapy从这个URL提取数据-那么这将是一个巨大的开销。我假设您要将解决方案扩展到这种类型的多个URL。需要捕获视频代码:L49VXZwfup8,因为这是视频youtube的代码
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.mp4" type='video/mp4'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.webm" type='video/webm'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.ogv" type='video/ogg'>
<meta name="description" content="Ciao a tutti amici di guardalo,quello che propongo oggi è un video sui neonati buffi con risate" />
html = urllib.urlopen('http://www.guardalo.org/99407/').read()
soup = BeautifulSoup(html)
title = soup.find('title').text