Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何刮取S3上每个文件夹内的下载链接_Python_Amazon S3_Web Scraping_Data Science - Fatal编程技术网

Python 如何刮取S3上每个文件夹内的下载链接

Python 如何刮取S3上每个文件夹内的下载链接,python,amazon-s3,web-scraping,data-science,Python,Amazon S3,Web Scraping,Data Science,我一直在抓取这个动态网站,它基本上是一个索引链接。我想获得每个文件夹中文件的所有下载链接,直到最后一个子文件夹。我不知道我应该用什么机制来做那件事 代码: 导入时间 导入lxml 导入请求 从bs4导入BeautifulSoup 从selenium导入webdriver 从selenium.webdriver.chrome.options导入选项 url='1〕http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix=' 选项=选项() opt

我一直在抓取这个动态网站,它基本上是一个索引链接。我想获得每个文件夹中文件的所有下载链接,直到最后一个子文件夹。我不知道我应该用什么机制来做那件事

代码:

导入时间
导入lxml
导入请求
从bs4导入BeautifulSoup
从selenium导入webdriver
从selenium.webdriver.chrome.options导入选项
url='1〕http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix='
选项=选项()
options.add_参数('--headless')
options.add_参数('--disable gpu')
driver=webdriver.Chrome(options=options)
获取驱动程序(url)
时间。睡眠(5)
page=driver.page\u源
driver.quit()
soup=BeautifulSoup(页面“html.parser”)
列表=[]
用于汤中的标签。查找所有('a'):
links=tags['href']
列表。追加(链接)
req=请求。获取('https://s3.amazonaws.com/dl.ncsbe.gov?delimiter=/“)。内容#来自F12中的网络工具
汤=美汤(需要“lxml”)
名称=[]
用于汤中常见的。查找所有('prefix')[2:]:
names.append(common.text)
names.sort()
打印(姓名)

我只想获取每个文件夹中每个文件类型的下载链接。

这是一个公共S3存储桶,因此您可以从根文件夹获取
XML

https://s3.amazonaws.com/dl.ncsbe.gov/
这意味着您可以将其作为响应,解析
XML
,并重新构建所有键的URL

以下是方法:

import requests
import xmltodict

base_url = "https://s3.amazonaws.com/dl.ncsbe.gov"
data = xmltodict.parse(requests.get(base_url).content)

valid_extensions = (
    ".pdf", ".doc", ".docx", ".txt", ".zip", ".xlsx", "xls", ".csv", ".mp4",
)

for item in data["ListBucketResult"]["Contents"]:
    if item["Key"].endswith(valid_extensions):
        s3_url = base_url + "/" if not item["Key"].startswith("/") else base_url
        print(f'{s3_url}{item["Key"].replace(" ", "%20")}')
这将以文件URL的形式输出S3的整个结构:

https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/2018%20County%20CF%20Procedures%20After%20the%20Election%20New%20Election%20Cycle%20Tasks.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Checklist.doc
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Letter%20-%20standard.docx
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-201%20Delinquent%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-202%20Late%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-203%20Noncompliant%20Comms.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Prohibited%20Receipts-Expenditures.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-201.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-202.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-203.pdf

and many more ...
试试这个链接