Javascript 网站垃圾网站,有一个按钮点击
我正在尝试创建一个具有多个javascript呈现页面的网站。我可以从第一页获取内容,但我不确定如何让脚本单击后续页面上的按钮来获取内容。这是我的剧本Javascript 网站垃圾网站,有一个按钮点击,javascript,python-3.x,selenium,web-scraping,beautifulsoup,Javascript,Python 3.x,Selenium,Web Scraping,Beautifulsoup,我正在尝试创建一个具有多个javascript呈现页面的网站。我可以从第一页获取内容,但我不确定如何让脚本单击后续页面上的按钮来获取内容。这是我的剧本 import time from bs4 import BeautifulSoup as soup import requests from selenium import webdriver from selenium.webdriver.chrome.options import Options import json # The path
import time
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
# The path to where you have your chrome webdriver stored:
webdriver_path = '/Users/rawlins/Downloads/chromedriver'
# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')
# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path,
chrome_options = chrome_options)
# Load webpage
url = "https://openlibrary.ecampusontario.ca/catalogue/"
browser.get(url)
# to ensure that the page has loaded completely.
time.sleep(3)
data = []
# Parse HTML, close browser
page_soup = soup(browser.page_source, 'lxml')
containers = page_soup.findAll("div", {"class":"result-item tooltip"})
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.find('h4', {'class' : 'textbook-title'}).text.strip()
item['author'] = container.find('p', {'class' : 'textbook-authors'}).text.strip()
item['link'] = "https://openlibrary.ecampusontario.ca/catalogue/" + container.find('h4', {'class' : 'textbook-title'}).a["href"]
item['source'] = "eCampus Ontario"
item['base_url'] = "https://openlibrary.ecampusontario.ca/catalogue/"
data.append(item) # add the item to the list
with open("js-webscrape-2.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
browser.quit()
我做了一个小脚本,可以帮助你 此脚本所做的是,在本例中,虽然目录的最后一页未被选中,但在其类中包含“selected”,我将进行刮取,然后单击next
while "selected" not in driver.find_elements_by_css_selector("[id='results-pagecounter-pages'] a")[-1].get_attribute("class"):
#your scraping here
driver.find_element_by_css_selector("[id='next-btn']").click()
使用此方法可能会遇到一个问题,它不会等待结果加载,但您可以从这里开始找出要执行的操作
希望它有帮助您不必实际单击任何按钮。例如,要搜索关键字为“Electrical”的项目,请导航到url
https://openlibrary-repo.ecampusontario.ca/rest/filtered-items?query_field%5B%5D=*&query_op%5B%5D=matches&query_val%5B%5D=(%3Fi)electricity&filters=is_not_withdrawn&offset=0&limit=10000
这将返回项目的json字符串,第一个项目为:
{"items":[{"uuid":"6af61402-b0ec-40b1-ace2-1aa674c2de9f","name":"Introduction to Electricity, Magnetism, and Circuits","handle":"123456789/579","type":"item","expand":["metadata","parentCollection","parentCollectionList","parentCommunityList","bitstreams","all"],"lastModified":"2019-05-09 15:51:06.91","parentCollection":null,"parentCollectionList":null,"parentCommunityList":null,"bitstreams":null,"withdrawn":"false","archived":"true","link":"/rest/items/6af61402-b0ec-40b1-ace2-1aa674c2de9f","metadata":null}, ...
现在,要获取该项目,请使用其uuid,并导航到:
https://openlibrary.ecampusontario.ca/catalogue/item/?id=6af61402-b0ec-40b1-ace2-1aa674c2de9f
你可以像这样与该网站进行任何互动。这并不总是适用于所有网站,但它适用于你的网站
要了解当您单击某某按钮或输入文本时导航到的URL是什么,您可以使用