Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何从包含分页的站点提取链接?(使用selenium)_Python_Regex_Python 3.x_Selenium_Selenium Chromedriver - Fatal编程技术网

Python 如何从包含分页的站点提取链接?(使用selenium)

Python 如何从包含分页的站点提取链接?(使用selenium),python,regex,python-3.x,selenium,selenium-chromedriver,Python,Regex,Python 3.x,Selenium,Selenium Chromedriver,我想从以下站点提取链接,但其中确实包含分页: 我正在使用以下代码片段: import time import requests import csv from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expe

我想从以下站点提取链接,但其中确实包含分页:

我正在使用以下代码片段:

import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import re


browser = webdriver.Chrome()
time.sleep(5)
browser.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(browser,15)

def extract_data(browser):
    links = browser.find_elements_by_xpath("//div[@class='seeMoreBtn']/a")
    return [link.get_attribute('href') for link in links]


element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, "//a[@class='glyphicon glyphicon-chevron-right']")))
max_pages = int(re.search(r'\d+ de (\d+)', element.text).group(1), re.UNICODE)
# extract from the current (1) page
print("Page 1")
print(extract_data(browser))

for page in range(2, max_pages + 1):
    print("Page %d" % page)
    next_page = browser.find_element_by_xpath("//a[@class='glyphicon glyphicon-chevron-right']").click()
    print(extract_data(browser))
    print("-----")
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import re



# ----------------------------------------------HANDLING-SELENIUM-STUFF-------------------------------------------------
linkList = []
driver = webdriver.Chrome()
time.sleep(5)
driver.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(driver,8)
time.sleep(7)

for i in range(1,2925):
    time.sleep(3)
    # wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "//div[@class='seeMoreBtn']/a")))
    links = driver.find_elements_by_xpath("//div[@class='seeMoreBtn']/a")
    # print(links.text)
    time.sleep(3)

    #appending extracted links to the list
    for link in links:
        value=link.get_attribute("href")
        # linkList.append(value)
        with open('test.csv','a',encoding='utf-8',newline='') as fp:
            writer = csv.writer(fp, delimiter=',')
            writer.writerow([value])
    # print(i,"  ",)
    time.sleep(1)
    driver.find_element_by_xpath("//a[@class='glyphicon glyphicon-chevron-right']").click()
    time.sleep(6)
当我运行上面的脚本时,我遇到了这个错误**(我不太熟悉正则表达式,只是在探索它的概念)**:


尝试以下代码以获取所需数据,而不需要额外的“睡眠”:


您想要提取哪些链接?您需要每个“更多信息”中的
@href
值吗?是的,我需要更多信息按钮中的@href值。不知何故,我使用上面粘贴的第二个代码提取了这些值。但这需要时间。由于分页部分有2925页,是否有其他方法可以将时间减少5-7秒?因为在加载页面后,需要2-3秒才能用以前的值更新新值。在这段时间内,如果我写这个文件,那么旧的值将被再次写入,得到重复的值,但是速度非常快。我应该在哪里添加将近3秒的等待时间,直到下一页完全加载?在“EC.Stalence”声明之后,还是在点击功能之后,它会起作用?先生,谢谢,非常感谢您的帮助。嗯。。。实际上我没有检查结果的唯一性<代码>等待。直到(EC.staleness\u of(new\u links[-1]))是等待最后一个链接被刷新,所以我认为整个链接列表都将被更新。。。你还需要修复吗?是的,如果可以的话,那就太好了。。。但你已经毫不迟疑地应付了。3秒的等待不会成为问题,因为我在循环中使用了12秒,这意味着复杂性上升到了地狱。但你几乎救了我一命
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import re



# ----------------------------------------------HANDLING-SELENIUM-STUFF-------------------------------------------------
linkList = []
driver = webdriver.Chrome()
time.sleep(5)
driver.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(driver,8)
time.sleep(7)

for i in range(1,2925):
    time.sleep(3)
    # wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "//div[@class='seeMoreBtn']/a")))
    links = driver.find_elements_by_xpath("//div[@class='seeMoreBtn']/a")
    # print(links.text)
    time.sleep(3)

    #appending extracted links to the list
    for link in links:
        value=link.get_attribute("href")
        # linkList.append(value)
        with open('test.csv','a',encoding='utf-8',newline='') as fp:
            writer = csv.writer(fp, delimiter=',')
            writer.writerow([value])
    # print(i,"  ",)
    time.sleep(1)
    driver.find_element_by_xpath("//a[@class='glyphicon glyphicon-chevron-right']").click()
    time.sleep(6)
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException



# ----------------------------------------------HANDLING-SELENIUM-STUFF-------------------------------------------------
driver = webdriver.Chrome()
driver.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')
wait = WebDriverWait(driver, 8)

links = []

while True:
    new_links = wait.until(EC.visibility_of_all_elements_located((By.LINK_TEXT, "MORE INFO")))
    links.extend([link.get_attribute("href") for link in new_links])

    try:
        next_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li[title='Next page']>a")))
        next_button.click()
    except TimeoutException:
        break
    wait.until(EC.staleness_of(new_links[-1]))

#  Do whatever you need with links