如何通过Python Selenius BeautifulSoup从网站中以文本形式提取安全性的价格

如何通过Python Selenius BeautifulSoup从网站中以文本形式提取安全性的价格,python,selenium,web-scraping,beautifulsoup,webdriverwait,Python,Selenium,Web Scraping,Beautifulsoup,Webdriverwait,我只是想简单地得到所示的证券价格。我运行以下代码: from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.Firefox(executable_path=r'C:\Program_Files_EllieTheGoodDog\Geckodriver\geckodriver.exe') driver.get('https://investor.vanguard.com/529-plan/pro

我只是想简单地得到所示的证券价格。我运行以下代码:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox(executable_path=r'C:\Program_Files_EllieTheGoodDog\Geckodriver\geckodriver.exe')
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
当我在selenium打开的Firefox中“检查元素”价格时,我清楚地看到:

<span data-ng-if="!data.isLayer" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" class="ng-scope ng-binding arrange">$42.91</span >

我完全被难住了。如果有人能给我指出正确的方向,我会非常感激。我感觉我完全遗漏了一些东西,可能有几件…

您使用
数据*
属性和值来选择范围的方式没有任何错误。事实上,这是中提到的正确方法。有4个span标记匹配所有属性
find_all
将返回所有这些标记。第二个对应于价格

您遗漏的是,加载跨度需要一些时间,在此之前返回页面源代码。您可以为该跨距搜索,然后获取页面源。这里我使用Xpath来等待元素。您可以通过进入
inspect工具->右键单击元素->复制->复制xpath来获取xpath

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[3]/div[1]/div/div/table/tbody/tr[1]/td[2]/div/span[1]')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
myspan = soup.find_all('span', attrs={'data-ng-if': '!data.isLayer', 'data-ng-bind-html': 'data.value', 'data-ng-class': '{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}', 'class': 'ng-scope ng-binding arrange'})
print(myspan)
print(myspan[1].text)
输出

[<span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">Unit price as of 02/15/2019</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">$42.91</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">Change</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer"><span class="number-positive">$0.47</span> <span class="number-positive">1.11%</span></span>]
$42.91
[截至2019年2月15日的单价为42.91美元,变动为0.47美元1.11%]
$42.91
仅硒就足以提取所需文本。您需要为位于的元素的可见性引入WebDriverWait,您可以使用以下解决方案:

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
    driver.get('https://investor.vanguard.com/529-plan/profile/4514')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//tr[@class='ng-scope']//td[@class='ng-scope right']//span[@class='ng-scope ng-binding arrange' and @data-ng-bind-html]"))).get_attribute("innerHTML"))
    
  • 控制台输出:

    $42.91
    

可以通过
数据集访问
数据-*
抱歉,但我不明白这意味着什么。我相信这只是我不知道自己在做什么的又一个迹象!但谢谢。不是真的,只是以
data-
开头的属性可以通过
dataset[]
访问。例如,
可以通过
document.querySelector('input#ease').getAttribute('dataset')[value]
$42.91