Python 如何从包含分页的站点提取链接?(使用selenium)
我想从以下站点提取链接,但其中确实包含分页: 我正在使用以下代码片段:Python 如何从包含分页的站点提取链接?(使用selenium),python,regex,python-3.x,selenium,selenium-chromedriver,Python,Regex,Python 3.x,Selenium,Selenium Chromedriver,我想从以下站点提取链接,但其中确实包含分页: 我正在使用以下代码片段: import time import requests import csv from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expe
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import re
browser = webdriver.Chrome()
time.sleep(5)
browser.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(browser,15)
def extract_data(browser):
links = browser.find_elements_by_xpath("//div[@class='seeMoreBtn']/a")
return [link.get_attribute('href') for link in links]
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, "//a[@class='glyphicon glyphicon-chevron-right']")))
max_pages = int(re.search(r'\d+ de (\d+)', element.text).group(1), re.UNICODE)
# extract from the current (1) page
print("Page 1")
print(extract_data(browser))
for page in range(2, max_pages + 1):
print("Page %d" % page)
next_page = browser.find_element_by_xpath("//a[@class='glyphicon glyphicon-chevron-right']").click()
print(extract_data(browser))
print("-----")
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import re
# ----------------------------------------------HANDLING-SELENIUM-STUFF-------------------------------------------------
linkList = []
driver = webdriver.Chrome()
time.sleep(5)
driver.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(driver,8)
time.sleep(7)
for i in range(1,2925):
time.sleep(3)
# wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "//div[@class='seeMoreBtn']/a")))
links = driver.find_elements_by_xpath("//div[@class='seeMoreBtn']/a")
# print(links.text)
time.sleep(3)
#appending extracted links to the list
for link in links:
value=link.get_attribute("href")
# linkList.append(value)
with open('test.csv','a',encoding='utf-8',newline='') as fp:
writer = csv.writer(fp, delimiter=',')
writer.writerow([value])
# print(i," ",)
time.sleep(1)
driver.find_element_by_xpath("//a[@class='glyphicon glyphicon-chevron-right']").click()
time.sleep(6)
当我运行上面的脚本时,我遇到了这个错误**(我不太熟悉正则表达式,只是在探索它的概念)**:
尝试以下代码以获取所需数据,而不需要额外的“睡眠”:
您想要提取哪些链接?您需要每个“更多信息”中的
@href
值吗?是的,我需要更多信息按钮中的@href值。不知何故,我使用上面粘贴的第二个代码提取了这些值。但这需要时间。由于分页部分有2925页,是否有其他方法可以将时间减少5-7秒?因为在加载页面后,需要2-3秒才能用以前的值更新新值。在这段时间内,如果我写这个文件,那么旧的值将被再次写入,得到重复的值,但是速度非常快。我应该在哪里添加将近3秒的等待时间,直到下一页完全加载?在“EC.Stalence”声明之后,还是在点击功能之后,它会起作用?先生,谢谢,非常感谢您的帮助。嗯。。。实际上我没有检查结果的唯一性<代码>等待。直到(EC.staleness\u of(new\u links[-1]))是等待最后一个链接被刷新,所以我认为整个链接列表都将被更新。。。你还需要修复吗?是的,如果可以的话,那就太好了。。。但你已经毫不迟疑地应付了。3秒的等待不会成为问题,因为我在循环中使用了12秒,这意味着复杂性上升到了地狱。但你几乎救了我一命
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import re
# ----------------------------------------------HANDLING-SELENIUM-STUFF-------------------------------------------------
linkList = []
driver = webdriver.Chrome()
time.sleep(5)
driver.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(driver,8)
time.sleep(7)
for i in range(1,2925):
time.sleep(3)
# wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "//div[@class='seeMoreBtn']/a")))
links = driver.find_elements_by_xpath("//div[@class='seeMoreBtn']/a")
# print(links.text)
time.sleep(3)
#appending extracted links to the list
for link in links:
value=link.get_attribute("href")
# linkList.append(value)
with open('test.csv','a',encoding='utf-8',newline='') as fp:
writer = csv.writer(fp, delimiter=',')
writer.writerow([value])
# print(i," ",)
time.sleep(1)
driver.find_element_by_xpath("//a[@class='glyphicon glyphicon-chevron-right']").click()
time.sleep(6)
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# ----------------------------------------------HANDLING-SELENIUM-STUFF-------------------------------------------------
driver = webdriver.Chrome()
driver.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(driver, 8)
links = []
while True:
new_links = wait.until(EC.visibility_of_all_elements_located((By.LINK_TEXT, "MORE INFO")))
links.extend([link.get_attribute("href") for link in new_links])
try:
next_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li[title='Next page']>a")))
next_button.click()
except TimeoutException:
break
wait.until(EC.staleness_of(new_links[-1]))
# Do whatever you need with links