Python 如何在没有';img&x27;标签?
最近我一直在学习如何使用webscrape,以便从我的学校目录下载所有图片。但是,在这些元素中,它们没有将图像存储在img标记下,而是将它们全部放在以下位置:background image:url(“/common/pages/GalleryPhoto.aspx?photoId=323070&width=180&height=180”) 无论如何要绕过这个 下面是当前的代码,可以从目标网站下载图像Python 如何在没有';img&x27;标签?,python,web-scraping,Python,Web Scraping,最近我一直在学习如何使用webscrape,以便从我的学校目录下载所有图片。但是,在这些元素中,它们没有将图像存储在img标记下,而是将它们全部放在以下位置:background image:url(“/common/pages/GalleryPhoto.aspx?photoId=323070&width=180&height=180”) 无论如何要绕过这个 下面是当前的代码,可以从目标网站下载图像 import os, requests, bsf n4, webbrowser, random
import os, requests, bsf n4, webbrowser, random
url = 'https://jhs.lsc.k12.in.us/staff_directory'
res = requests.get(url)
try:
res.raise_for_status()
except Exception as exc:
print('Sorry an error occured:', exc)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
element = soup.select('background-image')
for i in range(len(element)):
url = element[i].get('img')
name = random.randrange(1, 25)
file = open(str(name) + '.jpg', 'wb')
res = requests.get(url)
for chunk in res.iter_content(10000):
file.write(chunk)
file.close()
print('done')
您可以使用该站点使用的内部API来获取包括图像URL在内的数据。它首先使用
/settings
端点获取员工组列表,然后使用所有groupID调用/Search
api
流程如下所示:
- 从具有属性
数据portlet实例id
- 调用设置api并获取组ID:
POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings
- 使用pagination参数调用搜索api,您可以选择要请求的人数和每页的人数:
POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://jhs.lsc.k12.in.us/staff_directory")
soup = BeautifulSoup(r.content, "lxml")
portletInstanceId = soup.select('div[data-portlet-instance-id].staffDirectoryComponent')[0]["data-portlet-instance-id"]
r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings",
json = { "portletInstanceId": portletInstanceId })
groupIds = [t["groupID"] for t in r.json()["d"]["groups"]]
print(groupIds)
payload = {
"firstRecord": 0,
"groupIds": groupIds,
"lastRecord": 20,
"portletInstanceId": portletInstanceId,
"searchByJobTitle": True,
"searchTerm": "",
"sortOrder": "LastName,FirstName ASC"
}
r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search",
json = payload)
results = r.json()["d"]["results"]
#add image url based on userID
for t in results:
t["imageURL"] = f'https://jhs.lsc.k12.in.us/{t["imageURL"]}' if t["imageURL"] else ''
df = pd.DataFrame(results)
#whole data
print(df)
#only image url
with pd.option_context('display.max_colwidth', 400):
print(df["imageURL"])
您需要相应地更新
firstRecord
和lastRecord
字段您需要查看这些标签的样式
属性,并从中解析出图像URL。使用此方法,您将只获得前10个图像。你最好使用合适的网页抓取工具(wget,SiteSucker…)。