Python BeautifulSoap为具有特定类的div中的所有img获取多个元素_Python_Web Scraping_Beautifulsoup

Python BeautifulSoap为具有特定类的div中的所有img获取多个元素

python web-scraping

Python BeautifulSoap为具有特定类的div中的所有img获取多个元素,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在尝试获取图像文件属性相对链接中的链接，因为它位于id为previewImages的div下的img标记中。我不想要src链接以下是示例HTML： <div id="previewImages"> <div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div> <

我正在尝试获取图像文件属性相对链接中的链接，因为它位于id为previewImages的div下的img标记中。我不想要src链接

以下是示例HTML：

<div id="previewImages">
  <div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>

如何使用类previewImages获取div中img标记的图像文件属性中的所有链接？

使用.findAll

例：

BeautifulSoup有方法。查找所有-检查。以下是如何在代码中使用它：

import sys
import urllib2
from bs4 import BeautifulSoup

quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

images_box = soup.find('div', attrs={'id': 'previewImages'})
links = [img['image-file'] for img in images_box('img')]

print links   # in Python 3: print(links)

如果我们用lxml做了相同的场景

import lxml.html
tree = lxml.html.fromstring(sample)
images = tree.xpath("//img/@image-file")
print(images)

输出

['/image/15.jpg'、'/image/2.jpg'、'/image/0.jpg'、'/image/3.jpg'、'/image/4.jpg']

我认为在传递属性选择器进行选择的情况下使用id会更快

/image/15.jpg
/image/2.jpg
/image/0.jpg
/image/3.jpg
/image/4.jpg

import sys
import urllib2
from bs4 import BeautifulSoup

quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

images_box = soup.find('div', attrs={'id': 'previewImages'})
links = [img['image-file'] for img in images_box('img')]

print links   # in Python 3: print(links)

import lxml.html
tree = lxml.html.fromstring(sample)
images = tree.xpath("//img/@image-file")
print(images)

from bs4 import BeautifulSoup as bs
html = '''
<div id="previewImages">
  <div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
  <div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
'''
soup = bs(html, 'lxml')
links = [item['image-file'] for item in soup.select('#previewImages [image-file]')]
print(links)