Python 如何在没有';img&x27;标签?

Python 如何在没有';img&x27;标签?,python,web-scraping,Python,Web Scraping,最近我一直在学习如何使用webscrape,以便从我的学校目录下载所有图片。但是,在这些元素中,它们没有将图像存储在img标记下,而是将它们全部放在以下位置:background image:url(“/common/pages/GalleryPhoto.aspx?photoId=323070&width=180&height=180”) 无论如何要绕过这个 下面是当前的代码,可以从目标网站下载图像 import os, requests, bsf n4, webbrowser, random

最近我一直在学习如何使用webscrape,以便从我的学校目录下载所有图片。但是,在这些元素中,它们没有将图像存储在img标记下,而是将它们全部放在以下位置:background image:url(“/common/pages/GalleryPhoto.aspx?photoId=323070&width=180&height=180”)

无论如何要绕过这个

下面是当前的代码,可以从目标网站下载图像

import os, requests, bsf n4, webbrowser, random 
 
url = 'https://jhs.lsc.k12.in.us/staff_directory' 
  
res = requests.get(url)
try: 
    res.raise_for_status() 
except Exception as exc: 
    print('Sorry an error occured:', exc) 
 
soup = bs4.BeautifulSoup(res.text, 'html.parser') 
element = soup.select('background-image') 
 
for i in range(len(element)): 
    url = element[i].get('img') 
    name = random.randrange(1, 25) 
    file = open(str(name) + '.jpg', 'wb') 
    res = requests.get(url) 
    for chunk in res.iter_content(10000): 
        file.write(chunk) 
    file.close() 
 
print('done')

您可以使用该站点使用的内部API来获取包括图像URL在内的数据。它首先使用
/settings
端点获取员工组列表,然后使用所有groupID调用
/Search
api

流程如下所示:

  • 从具有属性
    数据portlet实例id

  • 调用设置api并获取组ID:

    POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings
    
  • 使用pagination参数调用搜索api,您可以选择要请求的人数和每页的人数:

    POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search
    
以下脚本获取20个第一人称,并将结果放入熊猫数据帧中:

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get("https://jhs.lsc.k12.in.us/staff_directory")
soup = BeautifulSoup(r.content, "lxml")

portletInstanceId = soup.select('div[data-portlet-instance-id].staffDirectoryComponent')[0]["data-portlet-instance-id"]

r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings",
    json = { "portletInstanceId": portletInstanceId })

groupIds = [t["groupID"] for t in r.json()["d"]["groups"]]
print(groupIds)

payload = {
    "firstRecord": 0,
    "groupIds": groupIds,
    "lastRecord": 20,
    "portletInstanceId": portletInstanceId,
    "searchByJobTitle": True,
    "searchTerm": "",
    "sortOrder": "LastName,FirstName ASC"
}

r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search",
    json = payload)

results = r.json()["d"]["results"]

#add image url based on userID
for t in results:
    t["imageURL"] = f'https://jhs.lsc.k12.in.us/{t["imageURL"]}' if t["imageURL"] else ''
 
df = pd.DataFrame(results)

#whole data
print(df)

#only image url
with pd.option_context('display.max_colwidth', 400):
    print(df["imageURL"])


您需要相应地更新
firstRecord
lastRecord
字段

您需要查看这些标签的
样式
属性,并从中解析出图像URL。使用此方法,您将只获得前10个图像。你最好使用合适的网页抓取工具(wget,SiteSucker…)。