Python 在使用scrapy制作的网络爬虫中,从另一个蜘蛛调用一个蜘蛛
我想点击网页上pdf文件所在的所有链接,并将这些pdf文件存储在我的系统中Python 在使用scrapy制作的网络爬虫中,从另一个蜘蛛调用一个蜘蛛,python,beautifulsoup,scrapy,web-crawler,Python,Beautifulsoup,Scrapy,Web Crawler,我想点击网页上pdf文件所在的所有链接,并将这些pdf文件存储在我的系统中 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from bs4 import BeautifulSoup class spider_a(BaseSpider): name = "Colleges" allowed_dom
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from bs4 import BeautifulSoup
class spider_a(BaseSpider):
name = "Colleges"
allowed_domains = ["http://www.abc.org"]
start_urls = [
"http://www.abc.org/appwebsite.html",
"http://www.abc.org/misappengineering.htm",
]
def parse(self, response):
soup = BeautifulSoup(response.body)
for link in soup.find_all('a'):
download_link = link.get('href')
if '.pdf' in download_link:
pdf_url = "http://www.abc.org/" + download_link
print pdf_url
有了上面的代码,我就可以在pdf文件所在的页面上找到链接
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class FileSpider(BaseSpider):
name = "fspider"
allowed_domains = ["www.aicte-india.org"]
start_urls = [
"http://www.abc.org/downloads/approved_institut_websites/an.pdf#toolbar=0&zoom=85"
]
def parse(self, response):
filename = response.url.split("/")[-1]
open(filename, 'wb').write(response.body)
使用此代码,我可以保存start\u url
中列出的页面正文
有没有办法加入这两个爬虫,这样我就可以通过运行爬虫程序来保存这些PDF?为什么需要两个爬虫
from urlparse import urljoin
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
class spider_a(BaseSpider):
...
def parse(self, response):
hxs = HtmlXPathSelector(response)
for href in hxs.select('//a/@href[contains(.,".pdf")]'):
yield Request(urljoin(response.url, href),
callback=self.save_file)
def save_file(self, response):
filename = response.url.split("/")[-1]
with open(filename, 'wb') as f:
f.write(response.body)
您好@steven谢谢您的帮助,但我收到了以下错误:exceptions.AttributeError:“HtmlXPathSelector”对象没有属性“find”,这是因为您需要使用
选择
,而不是查找
。。。如果你用的是刮痧,你不需要漂亮的汤。