Python Scrapy spider:从img src下载所有图像
我从一个网站上抓取了一些链接,我正在使用scrapy spider进行抓取Python Scrapy spider:从img src下载所有图像,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我从一个网站上抓取了一些链接,我正在使用scrapy spider进行抓取 # image urls look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li').extract_first() for i in look_inside_image_urls: print("============>
# image urls
look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li').extract_first()
for i in look_inside_image_urls:
print("============> look_inside_image_urls ===============>", i)
但是我没有得到类型值。只是我是任何一个李连杰的形象链接。我通过循环下载
这是我的HTML代码
<div class="lookInsideDiv" style="display: block;">
<div class="exitBtn"><i class="ion-close-round"></i></div>
<div class="pagesArea">
<ul class="list-unstyled pages">
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg"></li>
</ul>
</div>
</div>
尝试此方法,要提取所有图像,请使用extract its return list而不是extract_firstreturn first item方法
look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li/img/@src').extract()
for i in look_inside_image_urls:
print("============> look_inside_image_urls ===============>", i)
编辑
它是return[],nothingsponse.xpath'//ul[@class=list unstyled pages]/li/img/@src.extract ues thisx=response.xpath'//ul[@class=list unstyled pages]/li/img/@src.extract它returnd[],nohting,您可以检查它。请提供url链接。
look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li/img/@src').extract()
for i in look_inside_image_urls:
print("============> look_inside_image_urls ===============>", i)
from scrapy.selector import Selector
html ="""<div class="lookInsideDiv" style="display: block;">
<div class="exitBtn"><i class="ion-close-round"></i></div>
<div class="pagesArea">
<ul class="list-unstyled pages">
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg"></li>
<li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg"></li>
</ul>
</div>
</div>"""
data = Selector(text=html)
look_inside_image_urls = data.xpath('//*/ul[@class="list-unstyled pages"]/li/img/@src').extract()
for i in look_inside_image_urls:
print("============> look_inside_image_urls ===============>", i)
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg