Python 3.x 通过selenium抓取Google scholar网站_Python 3.x_Selenium_Selenium Webdriver

Python 3.x 通过selenium抓取Google scholar网站

python-3.x selenium selenium-webdriver

Python 3.x 通过selenium抓取Google scholar网站,python-3.x,selenium,selenium-webdriver,Python 3.x,Selenium,Selenium Webdriver,我正试图从谷歌学者页面中抓取引文、h-index和i10索引，并使用SeleniumWebDriver将其存储在熊猫数据框中 # install chromium, its driver, and selenium !apt-get update !apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin !pip install selenium # set options to

我正试图从谷歌学者页面中抓取引文、h-index和i10索引，并使用SeleniumWebDriver将其存储在熊猫数据框中

# install chromium, its driver, and selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
wd = webdriver.Chrome('chromedriver',options=options)

wd.get("https://scholar.google.com/citations?user=kukA0LcAAAAJ&hl=en&oi=ao")
divs = wd.find_elements_by_class_name('gsc_rsb'

for i in divs[0].find_elements_by_tag_name('a'):
  #print(i)
  print(i.get_attribute('text'))

其结果如下：

Get my own profile
Citations
h-index
i10-index
1051
1163
1365
1762
2001
2707
4192
7293
13372
25177
40711
65915
87193
101992
21846
View all
Aaron Courville
Pascal Vincent
Kyunghyun Cho
Ian Goodfellow
Yann LeCun
Hugo Larochelle
Caglar Gulcehre
Dzmitry Bahdanau
David Warde-Farley
Xavier Glorot
Razvan Pascanu
Leon Bottou
Sherjil Ozair
Mehdi Mirza
James Bergstra
Olivier Delalleau
Anirudh Goyal
Pascal Lamblin
Patrick Haffner
Nicolas Le Roux

但我只需要引用，h索引，i10索引，如以下数据框：

|  Name       |   Citations(All)   | Citations(since2016) | i10-index|i10-index(since2016)|
+-------------+--------------------+----------------------+----------+--------------------+
|Yoshua Bengio|   387118           |   343301             |  181     |        164         |

如何通过上述代码实现这一点？

divs=wd。按类查找元素\u名称'gsc\u rsb'无右括号结合select/select方法快速查找CSS选择器

注意，即使使用selenium或html请求，它仍然可能抛出验证码。如果自定义头没有帮助，您可以做的第一件事是将代理添加到您的请求中

以下是我想到的，它在其他个人资料中也对我有用：

输出：

            Name Citations Citations 2016 h-index h-index 2016 i10-index  \
0  Yoshua Bengio    387118         343301     181          164       625   

  i10-index 2016  
0            55

使用html-请求的其他解决方案

代码：

输出我猜PyCharm不喜欢控制台中的长名称，所以它将它们替换为。。。但价值观是存在的：

            Name Citations Citations 2016  ... h-index 2016 i10-index i10-index 2016
0  Yoshua Bengio    388599         344787  ...          164       627            552

或者，您也可以使用SerpApi。这是一个付费API，免费试用5000次搜索。查看要测试的文件

要集成的代码：

从serpapi导入谷歌搜索导入操作系统参数={ api_键：os.getenvAPI_键，引擎：谷歌学者作者，作者id:9PEPYK8AAAJ，嗯， } 搜索=谷歌搜索参数结果=search.get\u dict 引文全部=结果['QUICTED_']['table'][0]['QUICTES']['all'] 引文数量=结果[“引文数量”][“表”][0][“引文数量”][“自2016年以来”] h_inedx_all=结果['QUICED_by']['table'][1]['h_inedx']['all'] 2016年h_指数=结果['被引用]['表格][1]['h_指数]['自2016年以来]] i10_index_all=结果['QUICED_by']['table'][2]['i10_index']['all'] i10指数2016=自2016年以来的结果[“被引用”][“表”][2][“i10指数”][“自2016年以来”] printf'{引文{引文{2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n' public\u access\u link=结果['public\u access']['link'] public\u access\u available\u articles=结果['public\u access']['available'] printf'{public\u access\u link}\n{public\u access\u available\u articles}\n' 输出： ' 67595 28238 110 63 966 448 https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=9PepYk8AAAAJ 7. ' 免责声明，我为SerpApi工作

谢谢，缺少的结束括号实际上是我这边的一个打字错误。然而，当我尝试在GoogleColab中运行您的解决方案时，我得到了一个错误NotImplementedError：只实现了以下伪类：类型的第n个。这很有趣。我不确定是什么导致了这个问题。我试着运行我展示给你的相同代码，它在PyCharm和JyputerNotebook中都有效。可能，环境还没有完全设置好，但这只是一个猜测。我在上面的答案中添加了另一个解决方案，它使用请求html库对我很好。不幸的是，即使是另一个解决方案也失败了，错误为RuntimeError:无法在现有事件循环中使用HTMLSession。请改用AsyncHTMLSession。。我已经分享了我的GoogleColab链接，这样你就可以更好地理解是什么导致了这个问题。你试过在本地IDE中运行代码吗？PyCharm、VSCode等？我想Google Colab已经有了自己的事件循环。尝试在脚本之前添加这些代码行，可能会有所帮助：如果asyncio.get\u event\u loop.is\u正在运行，则导入asyncio:import nest\u asyncio nest\u asyncio.apply可能，您应该在使用上面添加的行运行脚本之前安装nest\u asyncio。如果有帮助，请告诉我。我按照您的建议尝试了在VSCode中运行，但AsyncHTMLSession仍然存在。链接中的屏幕截图：

from requests_html import HTMLSession
import pandas as pd

session = HTMLSession()
url = 'https://scholar.google.com/citations?user=kukA0LcAAAAJ'
r = session.get(url)
r.html.render()

name = r.html.find('#gsc_prf_in', first = True).text

citations_all = r.html.find('.gsc_rsb_std', first = True).text
citations_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(1) > td:nth-child(3)', first = True).text

h_index = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(2) > td:nth-child(2)', first = True).text
h_index_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(2) > td:nth-child(3)', first = True).text

i10_index = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(3) > td:nth-child(2)', first = True).text
i10_index_2016 = r.html.find('#gsc_rsb_st > tbody > tr:nth-child(3) > td:nth-child(3)', first = True).text

data = {
    "Name": [name],
    "Citations": citations_all,
    "Citations 2016": citations_2016,
    "h-index": h_index,
    "h-index 2016": h_index_2016,
    "i10-index": i10_index,
    "i10-index 2016": i10_index_2016,
}

df = pd.DataFrame(data)

print(df)

            Name Citations Citations 2016  ... h-index 2016 i10-index i10-index 2016
0  Yoshua Bengio    388599         344787  ...          164       627            552