Python 有没有办法从CSS选择器中获取特定文本？_Python_Selenium_Web Scraping

Python 有没有办法从CSS选择器中获取特定文本？

python selenium web-scraping

Python 有没有办法从CSS选择器中获取特定文本？,python,selenium,web-scraping,Python,Selenium,Web Scraping,当检查一个元素时，我希望抓取css选择器的所有#文本部分。我似乎在抓取选择器下的所有数字，而不是文本部分我正在抓取的链接是我想在“pick your phone price”下获取价格，但字符串末尾没有“$”和“99”美分目前我只熟悉抓取整个字符串 driver.get(link) time.sleep(3) print('--------------------------- begining ------------------') planType

当检查一个元素时，我希望抓取css选择器的所有#文本部分。我似乎在抓取选择器下的所有数字，而不是文本部分

我正在抓取的链接是

我想在“pick your phone price”下获取价格，但字符串末尾没有“$”和“99”美分

目前我只熟悉抓取整个字符串

    driver.get(link)
    time.sleep(3)
    print('---------------------------  begining ------------------')

    planTypeUpfrontCostListRaw = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#phonePricesList .ultra')))

    for element in planTypeUpfrontCostListRaw:
        upfrontCost = element.text
        print(upfrontCost)

    print('---------------------------  END  ------------------------')

解决方案1
使用

innerHTML

而不是使用

text

。这将返回该元素的html代码，包括文本

例如，它将返回您：

"<sup>$</sup>199<sup>99</sup>"

输出：

以下是执行此操作的代码：

from selenium.webdriver import Chrome
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import re

link = "https://www.virginmobile.ca/en/phones/phone-details.html#!/gs9/Grey/64/TR20"
driver = Chrome()
wait = WebDriverWait(driver, 15)
driver.get(link)
print('---------------------------  begining ------------------')

planTypeUpfrontCostListRaw = wait.until \
    (EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.price.ultra.ng-binding.ng-scope')))

for element in planTypeUpfrontCostListRaw:
     upfrontCost = element.get_attribute('innerHTML')
     upfrontCost = re.search('\d+', upfrontCost).group(0)
     print(upfrontCost)
print('---------------------------  END  ------------------------')

---------------------------  begining ------------------
0
0
199
349
739
1019
---------------------------  END  ------------------------

输出：

from selenium.webdriver import Chrome
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import re

link = "https://www.virginmobile.ca/en/phones/phone-details.html#!/gs9/Grey/64/TR20"
driver = Chrome()
wait = WebDriverWait(driver, 15)
driver.get(link)
print('---------------------------  begining ------------------')

planTypeUpfrontCostListRaw = wait.until \
    (EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.price.ultra.ng-binding.ng-scope')))

for element in planTypeUpfrontCostListRaw:
     upfrontCost = element.get_attribute('innerHTML')
     upfrontCost = re.search('\d+', upfrontCost).group(0)
     print(upfrontCost)
print('---------------------------  END  ------------------------')

---------------------------  begining ------------------
0
0
199
349
739
1019
---------------------------  END  ------------------------

解决方案2
您仍然可以使用

text

并使用

strip

删除$中不需要的数据，并删除最后两位数字

driver = Chrome()
wait = WebDriverWait(driver, 15)
driver.get(link)
print('---------------------------  begining ------------------')

planTypeUpfrontCostListRaw = wait.until \
    (EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.price.ultra.ng-binding.ng-scope')))

for element in planTypeUpfrontCostListRaw:
     upfrontCost = element.text.strip('$')
     if upfrontCost != '0':
         upfrontCost = upfrontCost[:-2]
     print(upfrontCost)
print('---------------------------  END  ------------------------')

您可以将其转储到bs4中并使用剥离的字符串

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://www.virginmobile.ca/en/phones/phone-details.html?province=ON&geoResult=failed#!/gs9/Grey/64/TR20')
WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "planlevels .price")))
soup = bs(d.page_source, 'lxml')
plans = soup.select('planlevels .price')

for plan in plans:
    price = [string for string in plan.stripped_strings][1]
    print(price)

更丑陋的是，在国际海事组织，可能会使用分裂和无BS4

plans = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "planlevels .price")))
for plan in plans:
    print(plan.get_attribute('innerHTML').split('</sup>')[1].split('<sup>')[0])

plans=WebDriverWait（d，10）.until（EC.presence\u所有元素的位置（（By.CSS\u选择器，“planlevels.price”））
对于平面图中的平面图：
打印（plan.get_属性（'innerHTML'）.split（“”）[1]。split（“”）[0]）