使用Python、Selenium、BeautifulSoup搜索后的Web刮取

使用Python、Selenium、BeautifulSoup搜索后的Web刮取,python,python-3.x,selenium,selenium-webdriver,beautifulsoup,Python,Python 3.x,Selenium,Selenium Webdriver,Beautifulsoup,在输入所有必要的信息后,我想从网上抓取一份高中汇总表。然而,我不知道怎么做,因为url在进入学校页面后不会改变。我没有找到任何与我正在尝试做的事情相关的东西。你知道我在搜索过程中如何刮桌子吗?多谢各位 import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.support.ui import Select from selenium.webdriver

在输入所有必要的信息后,我想从网上抓取一份高中汇总表。然而,我不知道怎么做,因为url在进入学校页面后不会改变。我没有找到任何与我正在尝试做的事情相关的东西。你知道我在搜索过程中如何刮桌子吗?多谢各位

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome("drivers/chromedriver")

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text(input("New Jersey"))

driver.find_element_by_id("city").send_keys(input("Galloway"))
driver.find_element_by_id("name").send_keys(input("Absegami High School"))
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

url = driver.current_url
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
school_info = soup.find('table', class_="border=")
print(school_info)

试试这个:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Jersey")

driver.find_element_by_id("city").send_keys("Galloway")
driver.find_element_by_id("name").send_keys("Absegami High School")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

#scraping the caption of the tables
all_sub_head = driver.find_elements_by_class_name("tableSubHeaderForWsrDetail") 

#scraping all the headers of the tables
all_headers = driver.find_elements_by_class_name("tableHeaderForWsrDetail")

#filtering the desired headers
required_headers = all_headers[5:]

#scraoing all the table data
all_contents = driver.find_elements_by_class_name("tdTinyFontForWsrDetail")

#filtering the desired tabla data
required_contents = all_contents[45:]
    
print("                ",all_sub_head[1].text,"                ")
for i in range(15):
    print(required_headers[i].text, "              >     ", required_contents[i].text )
    
print("execution completed")
输出

                 High School Summary                 
NCAA High School Code               >      310759
CEEB Code               >      310759
High School Name               >      ABSEGAMI HIGH SCHOOL
Address               >      201 S WRANGLEBORO RD
GALLOWAY
NJ - 08205
Primary Contact Name               >      BONNIE WADE
Primary Contact Phone               >      609-652-1485
Primary Contact Fax               >      609-404-9683
Primary Contact Email               >      bwade@gehrhsd.net
Secondary Contact Name               >      MR. DANIEL KERN
Secondary Contact Phone               >      6096521372
Secondary Contact Fax               >      6094049683
Secondary Contact Email               >      dkern@gehrhsd.net
School Website               >      http://www.gehrhsd.net/
Link to Online Course Catalog/Program of Studies               >      Not Available
Last Update of List of NCAA Courses               >      12-Feb-20
execution completed

输出屏幕截图:

您要刮取哪个表?因为页面上有多个表格。正如我在帖子中提到的,高中汇总表。使用这个
driver=webdriver.Chrome(“drivers/chromedriver”)
而不是
driver=webdriver.Chrome()
你能解释一下
required\u contents=all\u contents[45:]
还有一点吗?正如您所看到的,有三个表格高中账户状态、高中总结*、**高中信息,这三个表格之间的共同点是,标题(即蓝色)存储在一个公共类
表SubheaderForWSRdetail
下,所有黄色背景的文本都存储在另一个公共类中
tableHeaderForWsrDetail
,所有表格数据也存储在公共类中
tdTinyFontForWsrDetail
so
required\u contents=all\u contents[45:]
只需对表格数据进行切片,即5x9=表格高中账户状态的45个表格数据块,并将高中总结的剩余表格数据块存储在
所需内容中
列表中