Python 无法指向硒元素

Python 无法指向硒元素,python,python-3.x,selenium,selenium-webdriver,selenium-chromedriver,Python,Python 3.x,Selenium,Selenium Webdriver,Selenium Chromedriver,我正在写一个webscraper,它从CSV文件中浏览链接列表,并从每个链接中获取详细信息。但是,我在指向一个元素时遇到了问题,该元素包含我正试图获取的电子邮件地址。如果你看[https://reality.idnes.cz/rk/detail/m-m-reality-holding-a-s/5a85b582a26e3a321d4f2700/]您可以看到有一个公司名称、地址、电话号码和一封电子邮件。电子邮件是我有问题的元素。如果你查看网站的代码,你会很快注意到电话号码和电子邮件都有相同的标题类“

我正在写一个webscraper,它从CSV文件中浏览链接列表,并从每个链接中获取详细信息。但是,我在指向一个元素时遇到了问题,该元素包含我正试图获取的电子邮件地址。如果你看[https://reality.idnes.cz/rk/detail/m-m-reality-holding-a-s/5a85b582a26e3a321d4f2700/]您可以看到有一个公司名称、地址、电话号码和一封电子邮件。电子邮件是我有问题的元素。如果你查看网站的代码,你会很快注意到电话号码和电子邮件都有相同的标题类“项目图标”。如果您查看我的代码,您会发现我试图使用第n个子级引用实际的类,但由于某些原因,该子级也不起作用。结果不会打印并放入CSV文件,因此找不到。这是我遇到问题的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.chrome.options import Options
import time
import csv

with open('links.csv') as read:
    reader = csv.reader(read)
    link_list = list(reader)
    with open('ScrapedContent.csv', 'w+', newline='') as write:
        writer = csv.writer(write)
        options = Options()
        options.add_argument('--no-sandbox')
        path = "/home/kali/Desktop/SRealityContentScraper/chromedriver"
        driver = webdriver.Chrome(path)
        wait = WebDriverWait(driver, 10)
        for link in link_list:
            driver.get(', '.join(link))
            time.sleep(2)
            information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "h1.b-annot__title.mb-5")))
            title = driver.find_element_by_css_selector("h1.b-annot__title.mb-5")
            information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text")))
            offers = driver.find_element_by_css_selector("span.btn__text")
            information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "p.font-sm")))
            addresses = driver.find_element_by_css_selector("p.font-sm")
            try:
                information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "a.item-icon.measuring-data-layer")))
                phone_number = driver.find_element_by_css_selector("a.item-icon.measuring-data-layer")
            except Exception:
                pass
            try:
                information_list = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR,"span.items:nth-of-type(2) span.items__item a.item-icon")))
                email = driver.find_element_by_css_selector("span.items:nth-of-type(2) span.items__item a.item-icon")
            except Exception:
                pass
            try:
                phone_number = phone_number.text
            except Exception:
                phone_number = " "
                pass
            try:
                email = email.text
            except Exception:
                email = " "
                pass
            print(title.text, " ", offers.text, " ", addresses.text, " ", phone_number, " ", email)
            writer.writerow([title.text, offers.text, addresses.text, phone_number, email])

        driver.quit()
代码中存在try循环的原因是,有时链接列表中的某个页面缺少电子邮件或电话号码。所以我这样做,如果发生这种情况,信息的位置将被“”空字符串填充。然而,即使信息出现在页面上,也不会打印出来,这让我相信元素没有被正确找到。我删除了循环以测试输出,事实上,Selenium确认找不到元素。如果没有第n个孩子,刮板会刮取2个电话号码,而不是电话号码和一封电子邮件。据我所知,这是由于Selenium总是在页面上查找CSS选择器的第一个元素,即电话号码


我的问题是如何正确地指向元素,以便正确地刮取电子邮件?谢谢你的帮助!我开始感到绝望…

我会尽力帮助您,我的解决方案是使用XPath作为选择器:

工作原理->“//a[./span[包含(@class,'icon-icon-email')]”

//a->任何a

[./span]->内部的子span

[包含(@class,'icon-icon-email')]->包含该字符串的类

    xpath_phone = "//a[./span[contains(@class, 'icon icon--phone')]]"

    xpath_email = "//a[./span[contains(@class, 'icon icon--email')]]"

    #example for email
    try:
        information_list = wait.until(ec.presence_of_element_located((By.XPATH, xpath_email)))
        email = driver.find_element_by_xpath(xpath_email)
     except Exception:
         pass

打印电子邮件地址,即agorniak@mmreality.cz您必须为位于()的元素的可见性进行诱导,并且您可以使用以下任一项:

  • 使用
    CSS\u选择器

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.items__item>a[href*='@']"))).text)
    
  • 使用
    XPATH

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[@class='items__item']/a[contains(@href, '@')]"))).text)
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
您可以在中找到相关的讨论


为什么不尝试使用xpath来处理电子邮件?这个解决方案很有效!我知道XPath,但您似乎正在使用更高级的XPath类型。特别感谢您在我的例子中解释XPath是如何工作的!