如何使用python刮取所有日期
我需要清理soccerway.com,当我在比赛的每个环节中选择日期(例如2011-2013年)时,我遇到了问题,但问题仅保存在最后日期2012-2013,而不是2011-2012和2012-2013,而是仅保存在最后日期如何使用python刮取所有日期,python,selenium,selenium-webdriver,webdriver,Python,Selenium,Selenium Webdriver,Webdriver,我需要清理soccerway.com,当我在比赛的每个环节中选择日期(例如2011-2013年)时,我遇到了问题,但问题仅保存在最后日期2012-2013,而不是2011-2012和2012-2013,而是仅保存在最后日期 from time import sleep from urllib.parse import urlparse from bs4 import BeautifulSoup from selenium import webdriver from selenium.commo
from time import sleep
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
def get_urls_season(url_path):
driver = webdriver.Chrome()
driver.fullscreen_window()
driver.get("https://us.soccerway.com" + url_path)
click_privacy_policy(driver)
date = date_selector(driver)
#url_list = cycle_through_game_weeks(driver)
url_list.reverse()
driver.quit()
print("=" * 100)
print(f"{len(set(url_list))} find")
if input("con? (y/n): ") != "y":
exit()
return url_list
def date_selector(driver):
inptdate='2010-2012'
startdate=inptdate.split('-')[0]
enddate=inptdate.split('-')[1]
while int(startdate)< int(enddate):
textstring=str(startdate) + "/" + str(int(startdate)+1)
print(textstring)
driver.find_element_by_xpath("//select[@name='season_id']/option[text()='" + textstring +"']").click()
startdate=int(startdate)+1
url_list = cycle_through_game_weeks(driver)
def click_privacy_policy(driver):
try:
driver.find_element_by_class_name("qc-cmp-button").click()
except NoSuchElementException:
pass
def cycle_through_game_weeks(driver):
season_urls = get_fixture_urls(innerhtml_soup(driver))
while is_previous_button_enabled(driver):
click_previous_button(driver)
sleep(2)
urls = get_fixture_urls(innerhtml_soup(driver))
urls.reverse()
season_urls += urls
return season_urls
def is_previous_button_enabled(driver):
return driver.find_element_by_id(
"page_competition_1_block_competition_matches_summary_5_previous"
).get_attribute("class") != "previous disabled"
def click_previous_button(driver):
driver.find_element_by_id(
"page_competition_1_block_competition_matches_summary_5_previous"
).click()
def get_fixture_urls(soup):
urls = []
for elem in soup.select(".info-button.button > a"):
urls.append(urlparse(elem.get("href")).path)
return urls
def innerhtml_soup(driver):
html = driver.find_element_by_tag_name("html").get_attribute("innerHTML")
soup = BeautifulSoup(html, "html.parser")
return soup
我需要把2011-2013年的所有日期都删掉
2011-2012-2013不仅是最后一天
我找不到问题的原因。如果我正确理解了代码,问题就出在这里:
url_list = cycle_through_game_weeks(driver)
在每次迭代中,您都要用新的url_列表覆盖旧的url_列表,最简单的解决方案是:
url_list += cycle_through_game_weeks(driver)
更优雅、更有效:
url_list = []
while int(startdate)< int(enddate):
textstring=str(startdate) + "/" + str(int(startdate)+1)
print(textstring)
driver.find_element_by_xpath("//select[@name='season_id']/option[text()='" + textstring +"']").click()
startdate=int(startdate)+1
url_list.append(cycle_through_game_weeks(driver))
return url_list
这样,在url_列表[0]下,您将在url_列表[1]下获得第一年的值第二个,依此类推我使用第二个解决方案运行程序,但我有一个错误:如何解决printf{lenseturl_list}查找类型错误:不可损坏类型:“列出”这些点表示其他部分未老化