使用Python中的导出按钮下载/导出站点搜索结果

使用Python中的导出按钮下载/导出站点搜索结果,python,Python,因此,我尝试使用Python从以下网站(带有示例查询)中获取数据: 然而,我意识到,如果我以编程方式点击结果另存为“CSV”链接,并处理CSV数据,而不是抓取搜索结果,那会更容易,因为这将使我不必浏览搜索结果的所有页面 我检查了CSV链接元素,发现它名为“exportSearch('CSV')函数。 通过在控制台中键入函数名,我发现CSV链接只是将window.location.href设置为:导出/格式:CSV/fulltext:NASA%20NOAA%20coral 如果我在同一浏览器中跟

因此,我尝试使用Python从以下网站(带有示例查询)中获取数据:

然而,我意识到,如果我以编程方式点击结果另存为“CSV”链接,并处理CSV数据,而不是抓取搜索结果,那会更容易,因为这将使我不必浏览搜索结果的所有页面

我检查了CSV链接元素,发现它名为“exportSearch('CSV')函数。

通过在控制台中键入函数名,我发现CSV链接只是将window.location.href设置为:导出/格式:CSV/fulltext:NASA%20NOAA%20coral 如果我在同一浏览器中跟随该链接,则“保存”提示将打开,并显示要保存的csv

当我想使用python复制这个过程时,我的问题就开始了。如果我试图直接使用请求库调用导出链接,那么响应是空的

url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
response = requests.get(url)
print("Response: ", len(response.content)) 

有人能告诉我我遗漏了什么吗?我不知道如何首先在网站服务器上建立搜索结果,然后使用Python进行导出。

您可以使用下面的链接,使用
urllib
库下载Python中的文件

# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import urllib.request
urllib.request.urlretrieve(url)

# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)


# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command

# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command

同样,您也可以使用
wget
获取文件:

url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import wget
wget.download(url)

使用
urllib
库,您可以使用下面的命令下载python中的文件

# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import urllib.request
urllib.request.urlretrieve(url)

# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)


# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command

# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command

同样,您也可以使用
wget
获取文件:

url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import wget
wget.download(url)

我相信下载CSV的链接如下:

https://par.nsf.gov/export/format:csv//term:your_search_term
您的搜索词是URL编码的


在您的情况下,链接是:

我相信下载CSV的链接在这里:

https://par.nsf.gov/export/format:csv//term:your_search_term
您的搜索词是URL编码的


在您的情况下,链接是:

原来我缺少了一些cookie,这些cookie在您执行简单的请求获取时没有出现(例如WT_FPC)

为了解决这个问题,我使用selenium的webdriver执行了一个初始get请求,并使用该请求中的cookies在POST请求中设置,以下载CSV数据

from selenium import webdriver

chrome_path = "path to chrome driver"
with requests.Session() as session:
    url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    #GET fetches website plus needed cookies
    browser = webdriver.Chrome(executable_path = chrome_path)
    browser.get(url)        

    ## Session is set with webdriver's cookies
    request_cookies_browser = browser.get_cookies()
    [session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]

    url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    response = session.post(url)
    ## No longer empty 
    print(response.content.decode('utf-8'))

事实证明,我遗漏了一些在执行简单请求GET时没有出现的cookie(例如WT_FPC)

为了解决这个问题,我使用selenium的webdriver执行了一个初始get请求,并使用该请求中的cookies在POST请求中设置,以下载CSV数据

from selenium import webdriver

chrome_path = "path to chrome driver"
with requests.Session() as session:
    url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    #GET fetches website plus needed cookies
    browser = webdriver.Chrome(executable_path = chrome_path)
    browser.get(url)        

    ## Session is set with webdriver's cookies
    request_cookies_browser = browser.get_cookies()
    [session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]

    url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    response = session.post(url)
    ## No longer empty 
    print(response.content.decode('utf-8'))