使用Python、Selenium、BeautifulSoup搜索后的Web刮取
在输入所有必要的信息后,我想从网上抓取一份高中汇总表。然而,我不知道怎么做,因为url在进入学校页面后不会改变。我没有找到任何与我正在尝试做的事情相关的东西。你知道我在搜索过程中如何刮桌子吗?多谢各位使用Python、Selenium、BeautifulSoup搜索后的Web刮取,python,python-3.x,selenium,selenium-webdriver,beautifulsoup,Python,Python 3.x,Selenium,Selenium Webdriver,Beautifulsoup,在输入所有必要的信息后,我想从网上抓取一份高中汇总表。然而,我不知道怎么做,因为url在进入学校页面后不会改变。我没有找到任何与我正在尝试做的事情相关的东西。你知道我在搜索过程中如何刮桌子吗?多谢各位 import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.support.ui import Select from selenium.webdriver
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome("drivers/chromedriver")
driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")
state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text(input("New Jersey"))
driver.find_element_by_id("city").send_keys(input("Galloway"))
driver.find_element_by_id("name").send_keys(input("Absegami High School"))
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()
url = driver.current_url
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
school_info = soup.find('table', class_="border=")
print(school_info)
试试这个:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")
state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Jersey")
driver.find_element_by_id("city").send_keys("Galloway")
driver.find_element_by_id("name").send_keys("Absegami High School")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()
#scraping the caption of the tables
all_sub_head = driver.find_elements_by_class_name("tableSubHeaderForWsrDetail")
#scraping all the headers of the tables
all_headers = driver.find_elements_by_class_name("tableHeaderForWsrDetail")
#filtering the desired headers
required_headers = all_headers[5:]
#scraoing all the table data
all_contents = driver.find_elements_by_class_name("tdTinyFontForWsrDetail")
#filtering the desired tabla data
required_contents = all_contents[45:]
print(" ",all_sub_head[1].text," ")
for i in range(15):
print(required_headers[i].text, " > ", required_contents[i].text )
print("execution completed")
输出
High School Summary
NCAA High School Code > 310759
CEEB Code > 310759
High School Name > ABSEGAMI HIGH SCHOOL
Address > 201 S WRANGLEBORO RD
GALLOWAY
NJ - 08205
Primary Contact Name > BONNIE WADE
Primary Contact Phone > 609-652-1485
Primary Contact Fax > 609-404-9683
Primary Contact Email > bwade@gehrhsd.net
Secondary Contact Name > MR. DANIEL KERN
Secondary Contact Phone > 6096521372
Secondary Contact Fax > 6094049683
Secondary Contact Email > dkern@gehrhsd.net
School Website > http://www.gehrhsd.net/
Link to Online Course Catalog/Program of Studies > Not Available
Last Update of List of NCAA Courses > 12-Feb-20
execution completed
输出屏幕截图:您要刮取哪个表?因为页面上有多个表格。正如我在帖子中提到的,高中汇总表。使用这个
driver=webdriver.Chrome(“drivers/chromedriver”)
而不是driver=webdriver.Chrome()
你能解释一下required\u contents=all\u contents[45:]
还有一点吗?正如您所看到的,有三个表格高中账户状态、高中总结*、**高中信息,这三个表格之间的共同点是,标题(即蓝色)存储在一个公共类表SubheaderForWSRdetail
下,所有黄色背景的文本都存储在另一个公共类中tableHeaderForWsrDetail
,所有表格数据也存储在公共类中tdTinyFontForWsrDetail
sorequired\u contents=all\u contents[45:]
只需对表格数据进行切片,即5x9=表格高中账户状态的45个表格数据块,并将高中总结的剩余表格数据块存储在所需内容中
列表中