Python Scrapy Spider-处理非HTML链接(PDF、PPT等)
我正在学习Scrapy和Python,从一个空白项目开始。我正在使用Scrapy LXMLinkExtractor解析链接,但当遇到非HTML链接/页面(如PDF或其他文档)时,爬行器总是会卡住 问题:如果我只想存储这些URL(我现在不想要文档的内容…),我们通常如何处理这些带有Scrapy的链接 包含文档的示例页面: 这是我的蜘蛛代码:Python Scrapy Spider-处理非HTML链接(PDF、PPT等),python,scrapy,scrapy-spider,Python,Scrapy,Scrapy Spider,我正在学习Scrapy和Python,从一个空白项目开始。我正在使用Scrapy LXMLinkExtractor解析链接,但当遇到非HTML链接/页面(如PDF或其他文档)时,爬行器总是会卡住 问题:如果我只想存储这些URL(我现在不想要文档的内容…),我们通常如何处理这些带有Scrapy的链接 包含文档的示例页面: 这是我的蜘蛛代码: #!/usr/bin/env python # -*- coding: utf-8 -*- from scrapy.contrib.spiders impor
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from super.items import SuperItem
from scrapy.selector import Selector
class mySuper(CrawlSpider):
name="super"
#on autorise seulement le crawl du site indiqué dans allowed_domains
allowed_domains = ['afcorfmc.org']
#démarrage sur la page d'accueil du site
start_urls = ['http://afcorfmc.org']
rules = (Rule (LxmlLinkExtractor(allow=(),deny=(),restrict_xpaths=()), callback="parse_o", follow= True),)
def parse_o(self,response):
#récupération des datas récoltées (contenu de la page)
sel = Selector(response)
#on prépare item, on va le remplir (souvenez-vous, dans items.py)
item = SuperItem()
#on stocke l'url de la page dans le tableau item
item['url'] = response.url
#on récupère le titre de la page ( titre ) grâce à un chemin xpath
#item['titre'] = sel.xpath('//title/text()').extract()
# on fait passer item à la suite du processus
yield item
如中所述,lxmlinkextractor
默认情况下排除带有某些扩展名的链接:请参阅
此扩展列表包括.pdf
,.ppt
您可以将deny_extensions
参数添加到lxmlinkextractor
实例中,并将其保留为空,例如:
$ scrapy shell http://afcorfmc.org/2009.html
2014-10-27 10:27:02+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
...
2014-10-27 10:27:03+0100 [default] DEBUG: Crawled (200) <GET http://afcorfmc.org/2009.html> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f5b1a6f4910>
[s] item {}
[s] request <GET http://afcorfmc.org/2009.html>
[s] response <200 http://afcorfmc.org/2009.html>
[s] settings <scrapy.settings.Settings object at 0x7f5b2013f450>
[s] spider <Spider 'default' at 0x7f5b19e9bed0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
In [2]: lx = LxmlLinkExtractor(allow=(),deny=(),restrict_xpaths=(), deny_extensions=())
In [3]: lx.extract_links(response)
Out[3]:
[Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/ANATOMO_PATHOLOGIE_Dr_Guinebretiere.ppt', text='ANATOMO_PATHOLOGIE_Dr_Guinebretiere.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIMIOTHERAPIE_Dr_Toledano.ppt', text='CHIMIOTHERAPIE_Dr_Toledano.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIRURGIE_Dr_Guglielmina.ppt', text='CHIRURGIE_Dr_Guglielmina.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIRURGIE_Dr_Sebban.ppt', text='CHIRURGIE_Dr_Sebban.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/Cas_clinique_oesophage.ppt', text='Cas_clinique_oesophage.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/IMAGERIE_Dr_Seror.ppt', text='IMAGERIE_Dr_Seror.ppt', fragment='', nofollow=False),
...
Link(url='http://afcorfmc.org/documents/TOPOS/2009/OCTOBRE/VB4_Technique%20monoisocentrique%20dans%20le%20sein%20Vero%20Avignon%202009.pdf', text='VB4_Technique monoisocentrique dans le sein Vero Avignon 2009.pdf', fragment='', nofollow=False)]
In [4]:
$scrapy shellhttp://afcorfmc.org/2009.html
2014-10-27 10:27:02+0100[scrapy]信息:scrapy 0.24.4已启动(机器人:scrapybot)
...
2014-10-27 10:27:03+0100[默认]调试:爬网(200)(参考:无)
[s] 可用的刮擦对象:
[s] 爬虫
[s] 项目{}
[s] 请求
[s] 回应
[s] 背景
[s] 蜘蛛
[s] 有用的快捷方式:
[s] shelp()Shell帮助(打印此帮助)
[s] 获取(请求或url)获取请求(或url)并更新本地对象
[s] 查看(响应)在浏览器中查看响应
在[1]中:从scrapy.contrib.linkextractors.lxmlhtml导入lxmlinkextractor
在[2]中:lx=lxmlinkextractor(允许=(),拒绝=(),限制路径=(),拒绝扩展=())
[3]:lx.提取链接(响应)
出[3]:
[链接(url=]http://afcorfmc.org/documents/TOPOS/2009/MARS/ANATOMO_PATHOLOGIE_Dr_Guinebretiere.ppt,text='ANATOMO_PATHOLOGIE_Dr_Guinebritiere.ppt',fragment='',nofollow=False),
链接(url=)http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIMIOTHERAPIE_Dr_Toledano.ppt,text='CHIMIOTHERAPIE_Dr_Toledano.ppt',fragment='',nofollow=False),
链接(url=)http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIRURGIE_Dr_Guglielmina.ppt,text='CHIRURGIE_Dr_Guglielmina.ppt',fragment='',nofollow=False),
链接(url=)http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIRURGIE_Dr_Sebban.ppt“,text='CHIRURGIE_Dr_Sebban.ppt',fragment='',nofollow=False),
链接(url=)http://afcorfmc.org/documents/TOPOS/2009/MARS/Cas_clinique_oesophage.ppt,text='Cas_clinique_oesophage.ppt',fragment='',nofollow=False),
链接(url=)http://afcorfmc.org/documents/TOPOS/2009/MARS/IMAGERIE_Dr_Seror.ppt,text='IMAGERIE_Dr_Seror.ppt',fragment='',nofollow=False),
...
链接(url=)http://afcorfmc.org/documents/TOPOS/2009/OCTOBRE/VB4_Technique%20monoisocentrique%20dans%20le%20sein%20Vero%20Avignon%202009.pdf“,text='VB4_technology monosocentrique dans le sein Vero Avignon 2009.pdf',fragment='',nofollow=False)]
在[4]中:
我没有得到名为'scrapy.contrib'的模块。看来我们现在想要从scrapy.linkextractors.lxmlhtml导入lxmlinkextractor