Python 如何刮取S3上每个文件夹内的下载链接
我一直在抓取这个动态网站,它基本上是一个索引链接。我想获得每个文件夹中文件的所有下载链接,直到最后一个子文件夹。我不知道我应该用什么机制来做那件事 代码:Python 如何刮取S3上每个文件夹内的下载链接,python,amazon-s3,web-scraping,data-science,Python,Amazon S3,Web Scraping,Data Science,我一直在抓取这个动态网站,它基本上是一个索引链接。我想获得每个文件夹中文件的所有下载链接,直到最后一个子文件夹。我不知道我应该用什么机制来做那件事 代码: 导入时间 导入lxml 导入请求 从bs4导入BeautifulSoup 从selenium导入webdriver 从selenium.webdriver.chrome.options导入选项 url='1〕http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix=' 选项=选项() opt
导入时间
导入lxml
导入请求
从bs4导入BeautifulSoup
从selenium导入webdriver
从selenium.webdriver.chrome.options导入选项
url='1〕http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix='
选项=选项()
options.add_参数('--headless')
options.add_参数('--disable gpu')
driver=webdriver.Chrome(options=options)
获取驱动程序(url)
时间。睡眠(5)
page=driver.page\u源
driver.quit()
soup=BeautifulSoup(页面“html.parser”)
列表=[]
用于汤中的标签。查找所有('a'):
links=tags['href']
列表。追加(链接)
req=请求。获取('https://s3.amazonaws.com/dl.ncsbe.gov?delimiter=/“)。内容#来自F12中的网络工具
汤=美汤(需要“lxml”)
名称=[]
用于汤中常见的。查找所有('prefix')[2:]:
names.append(common.text)
names.sort()
打印(姓名)
我只想获取每个文件夹中每个文件类型的下载链接。这是一个公共S3存储桶,因此您可以从根文件夹获取
XML
:
https://s3.amazonaws.com/dl.ncsbe.gov/
这意味着您可以将其作为响应,解析XML
,并重新构建所有键的URL
以下是方法:
import requests
import xmltodict
base_url = "https://s3.amazonaws.com/dl.ncsbe.gov"
data = xmltodict.parse(requests.get(base_url).content)
valid_extensions = (
".pdf", ".doc", ".docx", ".txt", ".zip", ".xlsx", "xls", ".csv", ".mp4",
)
for item in data["ListBucketResult"]["Contents"]:
if item["Key"].endswith(valid_extensions):
s3_url = base_url + "/" if not item["Key"].startswith("/") else base_url
print(f'{s3_url}{item["Key"].replace(" ", "%20")}')
这将以文件URL的形式输出S3的整个结构:
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/2018%20County%20CF%20Procedures%20After%20the%20Election%20New%20Election%20Cycle%20Tasks.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Checklist.doc
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Letter%20-%20standard.docx
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-201%20Delinquent%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-202%20Late%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-203%20Noncompliant%20Comms.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Prohibited%20Receipts-Expenditures.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-201.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-202.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-203.pdf
and many more ...
试试这个链接