Python 使用Selenium以不一致的顺序处理web抓取数据
以下三个URL是我试图获取的数据示例。信息在页面的左侧,包括运动信息和一些其他统计数据。数据作为一个大元素被提取。我试着用索引号来区分个人信息,但每个运动员的信息顺序不同,或者根本不可用。这会导致索引错误或获得错误的信息(即,在蹲姿编号下获得40码短跑):Python 使用Selenium以不一致的顺序处理web抓取数据,python,selenium,web-scraping,Python,Selenium,Web Scraping,以下三个URL是我试图获取的数据示例。信息在页面的左侧,包括运动信息和一些其他统计数据。数据作为一个大元素被提取。我试着用索引号来区分个人信息,但每个运动员的信息顺序不同,或者根本不可用。这会导致索引错误或获得错误的信息(即,在蹲姿编号下获得40码短跑): 泽西:1 职位:CB、WR 身高和体重:6英尺1英寸189磅 40码短跑:4.55 法官席:190 蹲姿(磅):370 清洁(磅):225 类别:2021 泽西:6 职位:MLB、RB 身高和体重:6'1“210磅 类别:2021 泽西
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
import time
TIMEOUT = 5
driver = webdriver.Firefox()
driver.set_page_load_timeout(TIMEOUT)
url = 'https://www.hudl.com/profile/7670389/GaQuincy-McKinstry'
try:
driver.get(url)
except TimeoutException:
pass
time.sleep(3)
try:
isPresent = driver.find_element_by_xpath('//[@id="app"]/div/div[2]/div/div/div[2]/div[3]/div/div[1]/div[1]/div[1]/button')
isPresent.click()
except:
pass
time.sleep(3)
skills = driver.find_elements_by_css_selector('#app > div > div.prof-flex-height > div > div > div.parallax-layer.front > div.profile-tab > div > div.left-column > div.stats > ul')
skills = [one.text for one in skills]
print(skills)
try:
athletic_skills = driver.find_elements_by_class_name('stats-list')
athletic_skills = [skill.text for skill in athletic_skills]
athletic_skills = athletic_skills[-1].split('\n')
jersey = athletic_skills[0].replace('Jersey #: ', '')
position = athletic_skills[1].replace('Positions: ', '')
height_weight = athletic_skills[2].replace('Height & Weight: ', '')
height_weight = height_weight.split()
height = height_weight[0]
weight = height_weight[-1]
yard_dash = athletic_skills[3].replace('40 Yard Dash: ', '')
bench = athletic_skills[4].replace('Bench: ', '')
squat = athletic_skills[5].replace('Squat(LBS): ', '')
clean = athletic_skills[6].replace('Clean(LBS): ', '')
grad_year = athletic_skills[7].replace('Class of: ', '')
print(athletic_skills)
print(jersey)
print(position)
print(height_weight)
print(height)
print(weight)
print(yard_dash)
print(bench)
print(squat)
print(clean)
print(grad_year)
except:
pass
driver.close()
简短回答:首先为每个玩家将原始数据加载到Python字典中 更长的回答: 字典结构允许您将关键点(例如
40码短跑
)映射到相关统计数据(例如4.55
)
您可以使用您在运动技能
中获取的数据作为起点
例如:
# new empty dictionary:
mckinstry_skills = {}
for skill_stats in athletic_skills:
# separate the skill name from the related statistic:
skill_stats = skill_stats.split(': ', 1)
# add this as a new entry into the dictionary:
mckinstry_skills[skill_stats[0]] = skill_stats[1]
# print the full dictionary:
print(mckinstry_skills)
# print the results of retrieving one item:
print(mckinstry_skills['40 Yard Dash'])
第一个print
语句给出了以下输出(为了清晰起见,由我格式化):
第二个print
语句仅返回以下内容:
4.55
现在,您可以始终可靠地获得所需列的正确统计信息
由于并非所有玩家都拥有所有统计信息,因此在尝试获取相关统计信息之前,您可能需要确保密钥存在:
if '40 Yard Dash' in mckinstry_skills:
print(mckinstry_skills['40 Yard Dash'])
如果您不熟悉dicts,这里有很多概述。如果您已经熟悉,请原谅我的过度解释
if '40 Yard Dash' in mckinstry_skills:
print(mckinstry_skills['40 Yard Dash'])