使用Python中的导出按钮下载/导出站点搜索结果_Python

使用Python中的导出按钮下载/导出站点搜索结果

python

使用Python中的导出按钮下载/导出站点搜索结果,python,Python,因此，我尝试使用Python从以下网站（带有示例查询）中获取数据：然而，我意识到，如果我以编程方式点击结果另存为“CSV”链接，并处理CSV数据，而不是抓取搜索结果，那会更容易，因为这将使我不必浏览搜索结果的所有页面我检查了CSV链接元素，发现它名为“exportSearch（'CSV'）函数。通过在控制台中键入函数名，我发现CSV链接只是将window.location.href设置为：导出/格式：CSV/fulltext:NASA%20NOAA%20coral 如果我在同一浏览器中跟

因此，我尝试使用Python从以下网站（带有示例查询）中获取数据：

然而，我意识到，如果我以编程方式点击结果另存为“CSV”链接，并处理CSV数据，而不是抓取搜索结果，那会更容易，因为这将使我不必浏览搜索结果的所有页面

我检查了CSV链接元素，发现它名为“exportSearch（'CSV'）函数。

通过在控制台中键入函数名，我发现CSV链接只是将window.location.href设置为：导出/格式：CSV/fulltext:NASA%20NOAA%20coral 如果我在同一浏览器中跟随该链接，则“保存”提示将打开，并显示要保存的csv

当我想使用python复制这个过程时，我的问题就开始了。如果我试图直接使用请求库调用导出链接，那么响应是空的

url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
response = requests.get(url)
print("Response: ", len(response.content))

有人能告诉我我遗漏了什么吗？我不知道如何首先在网站服务器上建立搜索结果，然后使用Python进行导出。

您可以使用下面的链接，使用

urllib

库下载Python中的文件

# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import urllib.request
urllib.request.urlretrieve(url)

# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)


# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command

# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command

同样，您也可以使用

wget

获取文件：

url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import wget
wget.download(url)

使用

urllib

库，您可以使用下面的命令下载python中的文件

# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import urllib.request
urllib.request.urlretrieve(url)

# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)


# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command

# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command

同样，您也可以使用

wget

获取文件：

url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)

import wget
wget.download(url)

我相信下载CSV的链接如下：

https://par.nsf.gov/export/format:csv//term:your_search_term

您的搜索词是URL编码的
在您的情况下，链接是：
我相信下载CSV的链接在这里：
https://par.nsf.gov/export/format:csv//term:your_search_term

您的搜索词是URL编码的
在您的情况下，链接是：
原来我缺少了一些cookie，这些cookie在您执行简单的请求获取时没有出现（例如WT_FPC）
为了解决这个问题，我使用selenium的webdriver执行了一个初始get请求，并使用该请求中的cookies在POST请求中设置，以下载CSV数据
from selenium import webdriver

chrome_path = "path to chrome driver"
with requests.Session() as session:
    url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    #GET fetches website plus needed cookies
    browser = webdriver.Chrome(executable_path = chrome_path)
    browser.get(url)        

    ## Session is set with webdriver's cookies
    request_cookies_browser = browser.get_cookies()
    [session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]

    url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    response = session.post(url)
    ## No longer empty 
    print(response.content.decode('utf-8'))

事实证明，我遗漏了一些在执行简单请求GET时没有出现的cookie（例如WT_FPC）
为了解决这个问题，我使用selenium的webdriver执行了一个初始get请求，并使用该请求中的cookies在POST请求中设置，以下载CSV数据
from selenium import webdriver

chrome_path = "path to chrome driver"
with requests.Session() as session:
    url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    #GET fetches website plus needed cookies
    browser = webdriver.Chrome(executable_path = chrome_path)
    browser.get(url)        

    ## Session is set with webdriver's cookies
    request_cookies_browser = browser.get_cookies()
    [session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]

    url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")

    response = session.post(url)
    ## No longer empty 
    print(response.content.decode('utf-8'))