使用Python中的导出按钮下载/导出站点搜索结果
因此,我尝试使用Python从以下网站(带有示例查询)中获取数据: 然而,我意识到,如果我以编程方式点击结果另存为“CSV”链接,并处理CSV数据,而不是抓取搜索结果,那会更容易,因为这将使我不必浏览搜索结果的所有页面 我检查了CSV链接元素,发现它名为“exportSearch('CSV')函数。 通过在控制台中键入函数名,我发现CSV链接只是将window.location.href设置为:导出/格式:CSV/fulltext:NASA%20NOAA%20coral 如果我在同一浏览器中跟随该链接,则“保存”提示将打开,并显示要保存的csv 当我想使用python复制这个过程时,我的问题就开始了。如果我试图直接使用请求库调用导出链接,那么响应是空的使用Python中的导出按钮下载/导出站点搜索结果,python,Python,因此,我尝试使用Python从以下网站(带有示例查询)中获取数据: 然而,我意识到,如果我以编程方式点击结果另存为“CSV”链接,并处理CSV数据,而不是抓取搜索结果,那会更容易,因为这将使我不必浏览搜索结果的所有页面 我检查了CSV链接元素,发现它名为“exportSearch('CSV')函数。 通过在控制台中键入函数名,我发现CSV链接只是将window.location.href设置为:导出/格式:CSV/fulltext:NASA%20NOAA%20coral 如果我在同一浏览器中跟
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
response = requests.get(url)
print("Response: ", len(response.content))
有人能告诉我我遗漏了什么吗?我不知道如何首先在网站服务器上建立搜索结果,然后使用Python进行导出。您可以使用下面的链接,使用
urllib
库下载Python中的文件
# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import urllib.request
urllib.request.urlretrieve(url)
# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)
# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command
# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command
同样,您也可以使用wget
获取文件:
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import wget
wget.download(url)
使用
urllib
库,您可以使用下面的命令下载python中的文件
# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import urllib.request
urllib.request.urlretrieve(url)
# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)
# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command
# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command
同样,您也可以使用wget
获取文件:
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import wget
wget.download(url)
我相信下载CSV的链接如下:
https://par.nsf.gov/export/format:csv//term:your_search_term
您的搜索词是URL编码的
在您的情况下,链接是:我相信下载CSV的链接在这里:
https://par.nsf.gov/export/format:csv//term:your_search_term
您的搜索词是URL编码的
在您的情况下,链接是:原来我缺少了一些cookie,这些cookie在您执行简单的请求获取时没有出现(例如WT_FPC)
为了解决这个问题,我使用selenium的webdriver执行了一个初始get请求,并使用该请求中的cookies在POST请求中设置,以下载CSV数据
from selenium import webdriver
chrome_path = "path to chrome driver"
with requests.Session() as session:
url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")
#GET fetches website plus needed cookies
browser = webdriver.Chrome(executable_path = chrome_path)
browser.get(url)
## Session is set with webdriver's cookies
request_cookies_browser = browser.get_cookies()
[session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
response = session.post(url)
## No longer empty
print(response.content.decode('utf-8'))
事实证明,我遗漏了一些在执行简单请求GET时没有出现的cookie(例如WT_FPC)
为了解决这个问题,我使用selenium的webdriver执行了一个初始get请求,并使用该请求中的cookies在POST请求中设置,以下载CSV数据
from selenium import webdriver
chrome_path = "path to chrome driver"
with requests.Session() as session:
url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")
#GET fetches website plus needed cookies
browser = webdriver.Chrome(executable_path = chrome_path)
browser.get(url)
## Session is set with webdriver's cookies
request_cookies_browser = browser.get_cookies()
[session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
response = session.post(url)
## No longer empty
print(response.content.decode('utf-8'))