Python Beautifulsoup不'；无法到达子元素_Python_Beautifulsoup

Python Beautifulsoup不'；无法到达子元素

python

Python Beautifulsoup不'；无法到达子元素,python,beautifulsoup,Python,Beautifulsoup,我写了下面的代码，试图抓取一个GoogleScholar页面 import requests as req from bs4 import BeautifulSoup as soup url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections' session = req.Session() content =

我写了下面的代码，试图抓取一个GoogleScholar页面

import requests as req
from bs4 import BeautifulSoup as soup

url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections'

session = req.Session()
content = session.get(url)
html2bs = soup(content.content, 'lxml')
gs_cit = html2bs.select('#gs_cit')
gs_citd = html2bs.find('div', {'id':"gs_citd"})
gs_cit1 =  html2bs.find('div', {'id':"gs_cit1"})

但是

gs\u citd

只给了我这一行

，没有达到它下面的任何级别。另外，

gs_cit1

返回一个

None

如图所示

我想接触到突出显示的类，以便能够抓住BibTeX引用

你能帮忙吗

好吧，我想出来了。我使用了用于python的selenium模块，它创建了一个虚拟浏览器，如果您愿意，它将允许您执行诸如单击链接和获取结果HTML的输出之类的操作。在解决这个问题时，我遇到了另一个问题，那就是页面必须加载，否则它只会在弹出div中返回内容“Loading…”，因此我使用python时间模块来

time.sleep（2）

，持续2秒，这允许加载内容。然后，我只是使用BeautifulSoup解析结果HTML输出，以找到带有类“gs_citi”的锚标记。然后将href从锚中拉出来，并将其放入带有“requests”python模块的请求中。最后，我将解码后的响应写入本地文件-scholar.bib

我在Mac上安装了chromedriver和selenium，使用以下说明：

然后由python文件签名，以允许使用以下说明停止防火墙问题：

以下是我用来生成输出文件“scholar.bib”的代码：

希望这能帮助任何寻求解决方案的人

Scholar.bib文件：

@article{arrow2013sustainability,
  title={Sustainability and the measurement of wealth: further reflections},
  author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
  journal={Environment and Development Economics},
  volume={18},
  number={4},
  pages={504--516},
  year={2013},
  publisher={Cambridge University Press}
}

不幸的是，您想要刮取的

“Cite”

弹出窗口是从底层网页中的

获取的

javascript

事件。由于Beautifulsoup是一个解析器而不是交互式的Web浏览客户端，您可能需要考虑使用<代码>硒>代码>代码>虚幻JS>代码>或其他工具来解决这个问题。但当我试图抓取几个时，谷歌被吓坏了items@downshift你应该加上你的评论作为回答谢谢，@Kyle，这是一个非常彻底的解决方案。。。我只想澄清几件事。。。。首先，为什么不使用selenium解决它，直到结束。我能够模拟所有的点击，直到我用selenium抓取了引用。selenium的要点是，当我对几篇论文进行验证时，Google知道这是一个自动过程，并开始要求验证，这当然会停止该过程。你认为你的解决方案能解决这个问题吗？另一点是，selenium有一个

隐式等待（）

函数，我们可以使用它来代替

time.sleep（）

函数。我们没有意识到

隐式等待（）

功能可从selenium获得，我只是认为我们只需要通过bot自动完成获取正确源代码所需的工作，但我相信您可以通过扩展的selenium库轻松完成所有工作。我可以通过selenium@kyle实现这一点，但我担心的是身份验证问题。此外，例如，使用selenium方法处理一系列文件，我想这会降低内存和时间效率。

@article{arrow2013sustainability,
  title={Sustainability and the measurement of wealth: further reflections},
  author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
  journal={Environment and Development Economics},
  volume={18},
  number={4},
  pages={504--516},
  year={2013},
  publisher={Cambridge University Press}
}