在使用selenium进行webscraping Python时用于循环_Python_Selenium_Web Scraping_Beautifulsoup

在使用selenium进行webscraping Python时用于循环

python selenium web-scraping

在使用selenium进行webscraping Python时用于循环,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,我正试图从以下网站上获取信息：我正在努力为每个家族理财办公室刮取描述，因此“+插入公司名称”是我需要刮取的页面因此，我编写了以下代码，仅用一页测试程序： from bs4 import BeautifulSoup as soup from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome('insert_path_here/chromedri

我正试图从以下网站上获取信息：

我正在努力为每个家族理财办公室刮取描述，因此“+插入公司名称”是我需要刮取的页面

因此，我编写了以下代码，仅用一页测试程序：

from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome('insert_path_here/chromedriver')
driver.get("https://network.axial.net/company/ansaco-llp")
page_source = driver.page_source
soup2 = soup(page_source,"html.parser")
soup2.findAll('axl-teaser-description')[0].text

这适用于单个页面，只要描述没有“显示完整描述”下拉按钮。我将保存它以供其他问题使用

我编写了以下循环：

#Note: Lst2 has all the names for the companies. I made sure they match the webpage
lst3=[]
for key in lst2[1:]:
    driver.get("https://network.axial.net/company/"+key.lower())
    page_source = driver.page_source


    for handle in driver.window_handles:
         driver.switch_to.window(handle)
    word_soup = soup(page_source,"html.parser")



    if word_soup.findAll('axl-teaser-description') == []:
        lst3.append('null')
    else:
        c = word_soup.findAll('axl-teaser-description')[0].text
        lst3.append(c)
print(lst3)

当我运行循环时，所有值都显示为“null”，即使没有“单击以获取完整描述”按钮的值也是如此

我编辑了循环，而不是打印“word_soup”，如果我在没有循环的情况下运行它，并且没有描述文本，那么页面就不同了

我不明白为什么循环会导致这种情况，但显然是这样。有人知道如何解决这个问题吗？

我看到该页面使用javascript生成文本，这意味着它不会显示在页面源代码中，这很奇怪，但很正常。我不太明白为什么您只是迭代并切换到Sel的所有实例enium您已经打开了，但是在source/beautifulsoup页面中肯定找不到描述

老实说，如果可以的话，我个人会找一个更好的网站，否则，你就得用selenium试试。selenium效率很低，而且很糟糕。

找到了解决方案。在driver之后暂停程序3秒钟。获取：

import time
lst3=[]
for key in lst2[1:]:
    driver.get("https://network.axial.net/company/"+key.lower())
    time.sleep(3)
    page_source = driver.page_source



    word_soup = soup(page_source,"html.parser")



    if word_soup.findAll('axl-teaser-description') == []:
        lst3.append('null')
    else:
        c = word_soup.findAll('axl-teaser-description')[0].text
        lst3.append(c)
print(lst3)

您的第一个ansaco llp示例不适用于我。它找不到axl摘要描述元素。如果您打印并检查它，Page_source不会反映该元素。@Sri不确定为什么它不适用于您，但我找到了解决方案，我将在下一条评论中发布。不需要窗口处理循环，我在s中更改了它是的，我忘了浏览器需要时间来加载页面，而请求是内置的，而且实际上是即时的。