Python 如何遍历隐藏的div和刮取文本？_Python_Selenium_Web Scraping

Python 如何遍历隐藏的div和刮取文本？

python selenium web-scraping

Python 如何遍历隐藏的div和刮取文本？,python,selenium,web-scraping,Python,Selenium,Web Scraping,我正试图用可扩展的div来抓取一个隐藏文本的网站，我正试图抓取它。我只能在第一个可展开的div中刮取文本。但是，我可以单击所有div。如何从所有div中刮取文本关闭的HTML： <li class="views-row views-row-1 pub1 default-on clk" tabindex="150"> <div class="teaser Speeches col-xs-12 col-sm

我正试图用可扩展的div来抓取一个隐藏文本的网站，我正试图抓取它。我只能在第一个可展开的div中刮取文本。但是，我可以单击所有div。如何从所有div中刮取文本

关闭的HTML：

<li class="views-row views-row-1 pub1 default-on clk" tabindex="150">  
          <div class="teaser Speeches col-xs-12 col-sm-12 col-md-12 col-lg-12 crop2" data-nid="50849" data-tid="6971" aria-hidden="false">
  <div class="thumb" style="padding-top: 0px; padding-bottom: 0px;">
  <img class="img-responsive" src="/sites/pm/files/styles/news_listing_square/public/default_news/20180501_default_news2.jpg?itok=a1pfZTOA" alt="" title=""></div>
  <div class="news-teaser">
    <div class="title">TITLE</div>
    <div class="body">TEASER TEXT</div>
    <div class="category">Speeches<br>PLACE <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2019-06-10T18:15:00-04:00">June 10, 2019</span></div>
  </div>
</div>
<div class="sticky0"></div>
<div class="full-article" aria-hidden="true"></div>  
</li>
<li class="views-row views-row-2 pub1 default-on clk" tabindex="150"> </li>
<li class="views-row views-row-3 pub1 default-on clk" tabindex="150"> </li>

目前，我可以把第一篇文章的整个演讲稿删掉。然后，驱动程序单击下一个可展开div中的第二个语音，输出一大堆空白，并以与第二个语音相同的方式继续下几次语音（一大堆空白）

任何帮助都将不胜感激

您需要将搜索范围限定到当前div，而不是整个文档。对当前元素调用

find*

（

article

而不是

browser

）：

使用AJAX请求加载语音详细信息。这意味着您甚至不必为此使用selenium，

请求

就足够了，这大大加快了速度：

导入请求
从bs4导入BeautifulSoup
标题={
“用户代理”：“Mozilla/5.0（Windows NT 10.0；Win64；x64；rv:69.0）Gecko/20100101 Firefox/69.0”
}
def make_汤（url:str）->BeautifulSoup:
res=requests.get（url，headers=headers）
res.为_状态提高_（）
返回美化组（res.text，'html.parser'）
def fetch_speech_详细信息（speech_id:str）->str:
url=f'https://pm.gc.ca/eng/views/ajax?view_name=news_article&view_display_id=block&view_args={speech_id}'
res=requests.get（url，headers=headers）
res.为_状态提高_（）
data=res.json（）
html=数据[1]['data']
soup=BeautifulSoup（html，'html.parser'）
body=soup。选择一个（“.views字段body”）
返回str（body）
def scrape_演讲（汤：beautifulsou）->口述：
发言=[]
对于汤中的挑逗。选择（'.triser'）：
title=摘要。选择一个（'.title'）。text.strip（）
语音识别=摘要['data-nid']
speech\u html=获取语音详细信息（语音id）
s={
“标题”：标题，
“详细信息”：语音\u html
}
发言稿.附加(s)
如果名称=“\uuuuu main\uuuuuuuu”：
url='1〕https://pm.gc.ca/eng/news/speeches'
汤=制作汤（url）
演讲=即席演讲（汤）
从pprint导入pprint
警察公共关系科（演辞）

输出：

[
    {'title': 'PM remarks for Lunar Gateway', 'details': '<div class="views-field views-field-body"> <p>CHECK AGAINST DELIVERY</p><p>Hello everyone!</p><p>I’m delighted to be here at the Canadian Space Agency to share some great news with Canadians.</p><p>I’d like to start by thanking the President of the Agency, Sylvain Laporte ... },
    {...},
    ....
]

[
{'title'：'PM对月球之门的评论'，'details'：'检查交付情况
大家好！
我很高兴来到加拿大航天局，与加拿大人分享一些好消息。首先我要感谢航天局主席西尔万·拉波特…}，
{...},
....
]

谢谢您的回复。只是一些修改：1。在fetch_speech_details函数中，url前面有一个随机的“f”。2.声明url后，必须执行以下操作：

url=url.format（speech\u id=speech\u id）

。3.您忘记在scrape_speechs函数中返回演讲。顺便说一句，非常感谢你。另外，我还有两个后续问题：1。您如何知道这是AJAX，以及如何找到AJAX字符串。2.如何删除“详细信息”上的所有标记。3.我应该仍然使用selenium来实现无限滚动吗？

title = article.find_element_by_xpath("//h1[@class = 'field-content']")
speech_div = article.find_elements_by_xpath("//span[@lang = 'EN-CA']")

[
    {'title': 'PM remarks for Lunar Gateway', 'details': '<div class="views-field views-field-body"> <p>CHECK AGAINST DELIVERY</p><p>Hello everyone!</p><p>I’m delighted to be here at the Canadian Space Agency to share some great news with Canadians.</p><p>I’d like to start by thanking the President of the Agency, Sylvain Laporte ... },
    {...},
    ....
]