Python从HTTPS aspx下载图像
我正在尝试从NASS案例查看器下载一些图像。一个例子是Python从HTTPS aspx下载图像,python,asp.net,https,python-requests,binaryfiles,Python,Asp.net,Https,Python Requests,Binaryfiles,我正在尝试从NASS案例查看器下载一些图像。一个例子是 本例中指向图像查看器的链接为 这可能是不可见的,我想是因为https。但是,这只是前面的第二幅图像 图像的实际链接是(或应该是?) 这将只是下载aspx二进制文件 我的问题是,我不知道如何将这些二进制文件存储到正确的jpg文件中 我尝试过的代码示例是 import requests test_image = "https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?I
import requests
test_image = "https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1"
pull_image = requests.get(test_image)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(str.encode(pull_image.text))
但这并不能生成正确的jpg文件。我还检查了pull\u image.raw.read()
,发现它是空的
这里可能有什么问题?我的URL不正确吗?我使用Beautifulsoup将这些URL放在一起,并通过检查几页中的HTML代码来查看它们
我是否保存的二进制文件不正确?
.text
将响应内容解码为字符串,因此您的imge文件将被损坏。相反,您应该使用保存二进制响应内容的
import requests
test_image = "https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1"
pull_image = requests.get(test_image)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(pull_image.content)
.raw.read()
也返回字节,但要使用它,必须将流
参数设置为True
pull_image = requests.get(test_image, stream=True)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(pull_image.raw.read())
我想跟进@t.m.adam的回答,为任何有兴趣将这些数据用于自己项目的人提供一个完整的答案 下面是我的代码,用于提取案例ID示例的所有图像。这是一个相当不干净的代码,但我认为它为您提供了入门所需的内容
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
CaseIDs = [149006673, 149006651, 149006672, 149006673, 149006692, 149006693]
url_part1 = 'https://www-nass.nhtsa.dot.gov/nass/cds/'
data = []
with requests.Session() as sesh:
for caseid in tqdm(CaseIDs):
url_full = f"https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?ViewText&CaseID={caseid}&xsl=textonly.xsl&websrc=true"
#print(url_full)
source = sesh.get(url_full).text
soup = BeautifulSoup(source, 'lxml')
tr_tags = soup.find_all('tr', style="page-break-after: always")
for tag in tr_tags:
#print(tag)
"""
try:
vehicle = [x for x in tag.text.split('\n') if 'Vehicle' in x][0] ## return the first element
except IndexError:
vehicle = [x for x in tag.text.split('\n') if 'Scene' in x][0] ## return the first element
"""
tag_list = tag.find_all('tr', class_ = 'label')
test = [x.find('td').text for x in tag_list]
#print(test)
img_id, img_type, part_name = test
img_id = img_id.replace(":", "")
img = tag.find('img')
#part_name = img.get('alt').replace(":", "").replace("/", "")
part_name = part_name.replace(":", "").replace("/", "")
image_name = " ".join([img_type, part_name, img_id]) + ".jpg"
url_src = img.get('src')
img_url = url_part1 + url_src
print(img_url)
pull_image = sesh.get(img_url, stream=True)
with open(image_name, "wb+") as myfile:
myfile.write(pull_image.content)