使用Python Selenium抓取文本：无法找到真正存在的元素_Python_Selenium_Web Scraping

使用Python Selenium抓取文本：无法找到真正存在的元素

python selenium web-scraping

使用Python Selenium抓取文本：无法找到真正存在的元素,python,selenium,web-scraping,Python,Selenium,Web Scraping,我试图从以下页面源中提取文本：我使用selenium和python来刮取“Diese Termine stehen zu…” 到目前为止我试过什么使用xpath查找元素并使用绝对位置： availability=driver.通过xpath（“//*[@id='booking-content']/div[2]/div[4]/div/div[2]/div/div/div/div[1]/div/div/span”）查找元素使用类名： elements=driver。通过类名称（“dl-tex

我试图从以下页面源中提取文本：

我使用selenium和python来刮取“Diese Termine stehen zu…”
到目前为止我试过什么

使用xpath查找元素并使用绝对位置：

availability=driver.通过xpath（“//*[@id='booking-content']/div[2]/div[4]/div/div[2]/div/div/div/div[1]/div/div/span”）查找元素使用类名： elements=driver。通过类名称（“dl-text-dl-text-body-dl-text-regular-dl-text-s-dl-text-color-inherit”）查找元素。使用css选择器：使用以下关键字：.booking message.dl text availability = driver.find_element_by_css_selector('.booking-message .dl-text') 所有这些都不起作用。对于第3步，我确信，它应该可以工作，因为从屏幕截图中可以看到，我可以在Chrome中找到使用相同关键字的元素。但还是没有运气错误消息是： Traceback (most recent call last): File "/Users/GunardiLin/Desktop/Codes/Tracker.py", line 18, in <module> availability = driver.find_element_by_css_selector('.booking-message .dl-text') File "/Users/GunardiLin/opt/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 598, in find_element_by_css_selector return self.find_element(by=By.CSS_SELECTOR, value=css_selector) File "/Users/GunardiLin/opt/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element 'value': value})['value'] File "/Users/GunardiLin/opt/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/Users/GunardiLin/opt/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".booking-message .dl-text"} (Session info: chrome=90.0.4430.212) @itronic1990 22.05.2021 07:45：我已经检查了你的建议： driver.find_element_by_xpath(".//div[contains(@class,'booking-message')]/span").text 如上所示，chrome可以通过过滤器查找文本。但是如果我运行代码，它就找不到了。我的测试代码： import os from selenium import webdriver from selenium.webdriver.chrome.options import Options url = r"https://www.doctolib.de/gemeinschaftspraxis/muenchen/fuchs-hierl" chrome_options = Options() chrome_options.add_argument('--headless') driver = webdriver.Chrome(executable_path="/Applications/chromedriver", options=chrome_options) driver.get(url) element_text = driver.find_element_by_xpath(".//div[contains(@class,'booking-message')]/span").text print(element_text) driver.quit() 错误消息： NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//div[contains(@class,'booking-message')]/span"} (Session info: headless chrome=90.0.4430.212) 我不明白为什么？谢谢您的建议。您在xpath和按类名中使用了find\u元素。是这样吗试试这个 driver.find_element_by_xpath(".//div[contains(@class,'booking-message')]/span").text 您已经在xpath和按类名中使用了find_元素。是这样吗试试这个 driver.find_element_by_xpath(".//div[contains(@class,'booking-message')]/span").text 为什么要麻烦硒？直接从源获取数据： import requests url = 'https://www.doctolib.de/availabilities.json' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} payload = { 'start_date': '2021-05-21', 'visit_motive_ids': '2820334', 'agenda_ids': '466608', 'insurance_sector': 'public', 'practice_ids': '25230', 'limit': '4'} jsonData = requests.get(url, headers=headers, params=payload).json() 输出： print(jsonData['message']) Diese Termine stehen zu einem späteren Zeitpunkt wieder für eine Online-Buchung zur Verfügung. 我对德语不熟悉，否则我可能会使它更有效率。使用practice\u id 将数据输入其中，并从每个实践中获取数据 import requests from bs4 import BeautifulSoup from datetime import datetime # Get location practice_ids url = 'https://www.doctolib.de/allgemeinmedizin/81667-muenchen' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} practice_ids_list = [] for page in range(1,100): payload = {'page':page} response = requests.get(url, headers=headers, params=payload) if response.status_code == 404: break else: print('Page: %s' %page) soup = BeautifulSoup(response.text, 'html.parser') divs = soup.find_all('div',{'class':'dl-search-result'}) for div in divs: practice_id = div['id'].split('-')[-1] practice_ids_list.append(practice_id) today = datetime.today().strftime('%Y-%m-%d') url = 'https://www.doctolib.de/availabilities.json' for practice_id in practice_ids_list: payload = { 'start_date': today, 'visit_motive_ids': '2820334', 'agenda_ids': '466606', 'insurance_sector': 'public', 'practice_ids': '%s' %practice_id, 'limit': '15'} jsonData = requests.get(url, headers=headers, params=payload).json() if jsonData['total'] == 0 and 'next_slot' not in jsonData.keys(): #print('\t', jsonData['message'],'\n') print(practice_id) else: # Get Clinic Details clinic_url = 'https://www.doctolib.de/search_results/%s.json' %practice_id clinic_jsonData = requests.get(clinic_url, headers=headers).json() clinic_name = clinic_jsonData['search_result']['name_with_title'] address = clinic_jsonData['search_result']['address'] city = clinic_jsonData['search_result']['city'] zipcode = clinic_jsonData['search_result']['zipcode'] print('%s\n%s %s %s' %(clinic_name, address, city, zipcode)) payload.update({'start_date':jsonData['next_slot']}) jsonData = requests.get(url, headers=headers, params=payload).json() print('\n\t','*'*50,'\nThe follow dates are available:') for each_date in jsonData['availabilities']: if len(each_date['slots']) > 0: print('\t\t',each_date['date']) 为什么要麻烦硒？直接从源获取数据： import requests url = 'https://www.doctolib.de/availabilities.json' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} payload = { 'start_date': '2021-05-21', 'visit_motive_ids': '2820334', 'agenda_ids': '466608', 'insurance_sector': 'public', 'practice_ids': '25230', 'limit': '4'} jsonData = requests.get(url, headers=headers, params=payload).json() 输出： print(jsonData['message']) Diese Termine stehen zu einem späteren Zeitpunkt wieder für eine Online-Buchung zur Verfügung. 我对德语不熟悉，否则我可能会使它更有效率。使用practice\u id 将数据输入其中，并从每个实践中获取数据 import requests from bs4 import BeautifulSoup from datetime import datetime # Get location practice_ids url = 'https://www.doctolib.de/allgemeinmedizin/81667-muenchen' headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} practice_ids_list = [] for page in range(1,100): payload = {'page':page} response = requests.get(url, headers=headers, params=payload) if response.status_code == 404: break else: print('Page: %s' %page) soup = BeautifulSoup(response.text, 'html.parser') divs = soup.find_all('div',{'class':'dl-search-result'}) for div in divs: practice_id = div['id'].split('-')[-1] practice_ids_list.append(practice_id) today = datetime.today().strftime('%Y-%m-%d') url = 'https://www.doctolib.de/availabilities.json' for practice_id in practice_ids_list: payload = { 'start_date': today, 'visit_motive_ids': '2820334', 'agenda_ids': '466606', 'insurance_sector': 'public', 'practice_ids': '%s' %practice_id, 'limit': '15'} jsonData = requests.get(url, headers=headers, params=payload).json() if jsonData['total'] == 0 and 'next_slot' not in jsonData.keys(): #print('\t', jsonData['message'],'\n') print(practice_id) else: # Get Clinic Details clinic_url = 'https://www.doctolib.de/search_results/%s.json' %practice_id clinic_jsonData = requests.get(clinic_url, headers=headers).json() clinic_name = clinic_jsonData['search_result']['name_with_title'] address = clinic_jsonData['search_result']['address'] city = clinic_jsonData['search_result']['city'] zipcode = clinic_jsonData['search_result']['zipcode'] print('%s\n%s %s %s' %(clinic_name, address, city, zipcode)) payload.update({'start_date':jsonData['next_slot']}) jsonData = requests.get(url, headers=headers, params=payload).json() print('\n\t','*'*50,'\nThe follow dates are available:') for each_date in jsonData['availabilities']: if len(each_date['slots']) > 0: print('\t\t',each_date['date']) 在应用驱动程序之前，您可能错过了一些等待/延迟。请通过\u css\u选择器（'.booking message.dl text'）查找\u元素。？您可以共享指向该网页的链接吗？@gunardilin您到底想获取什么？你期望的结果是什么？@gunardilin我打开了那个链接。我看不到任何元素与.booking message.dl text 定位器匹配。我确实看到元素位于.booking message 中，但其中没有任何内容。所谓等待/延迟，我的意思是将一些预期条件置于等待某些条件的位置，例如元素可见等。但我仍然看不到这个元素，我不确定它是否相关。但是，该网站可能会针对不同的位置显示不同的数据，因此它不会显示您在那里看到的内容。在应用驱动程序之前，您可能会错过一些等待/延迟。通过css选择器（'.booking message.dl text'）查找元素？您可以共享指向该网页的链接吗？@gunardilin您到底想得到什么？你期望的结果是什么？@gunardilin我打开了那个链接。我看不到任何元素与.booking message.dl text 定位器匹配。我确实看到元素位于.booking message 中，但其中没有任何内容。所谓等待/延迟，我的意思是将一些预期条件置于等待某些条件的位置，例如元素可见等。但我仍然看不到这个元素，我不确定它是否相关。但是，网站可能会针对不同的位置提供不同的数据，因此它显示的不是你在那里看到的我在我的原始帖子中包含了我的所有代码/尝试。我试过你的建议，但还是不起作用。这让我很困惑，为什么它不能工作…你能在开发者控制台中尝试xpath，看看它是否返回任何元素吗？嘿，我已经尝试过你的建议了。仍然不起作用。我更新了我原来的帖子来回答你的问题。基本上开发人员控制台可以使用过滤器找到它，但脚本没有。。。感谢您的进一步帮助：-）我已经在我的原始帖子中包含了我的所有代码/尝试。我试过你的建议，但还是不起作用。这让我很困惑，为什么它不能工作…你能在开发者控制台中尝试xpath，看看它是否返回任何元素吗？嘿，我已经尝试过你的建议了。仍然不起作用。我更新了我原来的帖子来回答你的问题。基本上开发人员控制台可以使用过滤器找到它，但脚本没有。。。谢谢你的进一步帮助：-）哇，看起来很有希望。对不起，我是个初学者。使用普通请求的原因是什么？什么时候美素和硒更好？我没有想到直接使用request。谢谢你的回答：-）你为什么问我是否只对这一个地点感兴趣？你的意思是我也可以在其他位置运行相同的脚本吗？谢谢你的帮助clarification@gunardilin. 是的，您可以在这里更改参数以查看不同的疫苗和不同的位置。只需计算出这些id/代码，然后您就可以让脚本查看所有这些位置以检查日期。我会告诉你我的意思（我会调整上面的代码）。如果数据可以直接从api获取或以json格式返回，则使用简单的请求（无需从html中解析数据。如果需要从html源代码中获取数据，则使用请求获取html，然后使用BS从中解析数据。否则，请使用Selenium。转到开发工具（shift-ctrl-i）。在网络->XHR选项卡下，您可以看到它（您可能需要刷新页面）。至于参数，只需进行反复试验（即，在开发工具打开的情况下，单击网站上的更改内容，并查看XHR，注意您单击的内容以及更改的内容）老实说，我从来没有做过网上教程或粗略的网页抓取。只是练习和研究了不同的方式来做。只是从tria那里学到了一些东西