Python 如何使用selenium从一个页面中刮取多个网页?
最近,我一直试图从一个网站上获取大量的定价,从一个页面开始,每个项目的页面都链接到起始页面。我希望运行一个脚本,允许我单击某个项目的框,删除该项目的定价和描述,然后返回起始页并继续该循环。然而,有一个明显的问题,我在刮掉第一件物品后遇到了。返回起始页后,容器没有定义,因此出现了一个陈旧的元素错误,该错误会中断循环并阻止我获取其余的项。这是我使用的示例代码,希望能够一个接一个地刮去所有项目Python 如何使用selenium从一个页面中刮取多个网页?,python,selenium,selenium-webdriver,Python,Selenium,Selenium Webdriver,最近,我一直试图从一个网站上获取大量的定价,从一个页面开始,每个项目的页面都链接到起始页面。我希望运行一个脚本,允许我单击某个项目的框,删除该项目的定价和描述,然后返回起始页并继续该循环。然而,有一个明显的问题,我在刮掉第一件物品后遇到了。返回起始页后,容器没有定义,因此出现了一个陈旧的元素错误,该错误会中断循环并阻止我获取其余的项。这是我使用的示例代码,希望能够一个接一个地刮去所有项目 driver = webdriver.Chrome(r'C:\Users\Hank\Desktop\chro
driver = webdriver.Chrome(r'C:\Users\Hank\Desktop\chromedriver_win32\chromedriver.exe')
driver.get('https://steamcommunity.com/market/search?q=&category_440_Collection%5B%5D=any&category_440_Type%5B%5D=tag_misc&category_440_Quality%5B%5D=tag_rarity4&appid=440#p1_price_asc')
import time
time.sleep(5)
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
action = ActionChains(driver)
next_button=wait(driver, 10).until(EC.element_to_be_clickable((By.ID,'searchResults_btn_next')))
def prices_and_effects():
action = ActionChains(driver)
imgs = wait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'img.market_listing_item_img.economy_item_hoverable')))
for img in imgs:
ActionChains(driver).move_to_element(img).perform()
print([my_element.text for my_element in wait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.item_desc_description div.item_desc_descriptors#hover_item_descriptors div.descriptor")))])
prices = driver.find_elements(By.CSS_SELECTOR, 'span.market_listing_price.market_listing_price_with_fee')
for price in prices:
print(price.text)
def unusuals():
unusuals = wait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.market_listing_row.market_recent_listing_row.market_listing_searchresult')))
for unusual in unusuals:
unusual.click()
time.sleep(2)
next_button=wait(driver, 10).until(EC.element_to_be_clickable((By.ID,'searchResults_btn_next')))
next_button.click()
time.sleep(2)
back_button=wait(driver, 10).until(EC.element_to_be_clickable((By.ID,'searchResults_btn_prev')))
back_button.click()
time.sleep(2)
prices_and_effects()
ref_val = wait(driver, 10).until(EC.presence_of_element_located((By.ID, 'searchResults_start'))).text
while next_button.get_attribute('class') == 'pagebtn':
next_button.click()
wait(driver, 10).until(lambda driver: wait(driver, 10).until(EC.presence_of_element_located((By.ID,'searchResults_start'))).text != ref_val)
prices_and_effects()
ref_val = wait(driver, 10).until(EC.presence_of_element_located((By.ID, 'searchResults_start'))).text
time.sleep(2)
driver.execute_script("window.history.go(-1)")
time.sleep(2)
unusuals = wait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.market_listing_row.market_recent_listing_row.market_listing_searchresult')))
unusuals()
然而,在成功地抓取第一项之后,它返回页面并抛出一个陈旧的元素错误。这个错误对我来说是有意义的,但是有没有办法绕过它,这样我就可以保留函数并使用循环?Selenium在这方面做得太过分了。您可以模拟HTTP GET请求,使其与呈现页面时浏览器向其发出请求的API相同。只是要小心,不要每天向Steam API发出超过100000个请求。此外,如果请求发生得太频繁,Steam服务器会推断并停止响应请求,直到某个超时时间到期,即使您还没有达到每天100000次请求的限制-这就是为什么我添加了一些
时间。在每次请求后使用项\u id
睡眠s作为良好的度量
首先,您向market listings页面发出请求,该页面显示所有项目。然后,对于结果列表中的每个项目,我们提取项目的名称,并向该项目的概览页面发出请求,并使用正则表达式从HTML中提取项目的item\u id
。然后,我们向https://steamcommunity.com/market/itemordershistogram
获取该商品的最新价格信息
您可以在param
字典中随意使用start
和count
查询字符串参数。现在它只打印前十项的信息:
def main():
import requests
from bs4 import BeautifulSoup
import re
import time
url = "https://steamcommunity.com/market/search/render/"
params = {
"query": "",
"start": "0",
"count": "10",
"search_descriptions": "0",
"sort_column": "price",
"sort_dir": "asc",
"appid": "440",
"category_440_Collection[]": "any",
"category_440_Type[]": "tag_misc",
"category_440_Quality[]": "tag_rarity4"
}
response = requests.get(url, params=params)
response.raise_for_status()
time.sleep(1)
item_id_pattern = r"Market_LoadOrderSpread\( (?P<item_id>\d+) \)"
soup = BeautifulSoup(response.json()["results_html"], "html.parser")
for result in soup.select("a.market_listing_row_link"):
url = result["href"]
product_name = result.select_one("div")["data-hash-name"]
try:
response = requests.get(url)
response.raise_for_status()
time.sleep(1)
item_id_match = re.search(item_id_pattern, response.text)
assert item_id_match is not None
except:
print(f"Skipping {product_name}")
continue
url = "https://steamcommunity.com/market/itemordershistogram"
params = {
"country": "DE",
"language": "english",
"currency": "1",
"item_nameid": item_id_match.group("item_id"),
"two_factor": "0"
}
response = requests.get(url, params=params)
response.raise_for_status()
time.sleep(1)
data = response.json()
highest_buy_order = float(data["highest_buy_order"]) / 100.0
print(f"The current highest buy order for \"{product_name}\" is ${highest_buy_order}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
也许硒的杀伤力太强了,刮痒就好了。您必须定义一个“主页”刮板和一个“项目”刮板,然后开始对主页进行刮板,这将创建一个项目列表,项目刮板将对其进行刮板,而scrapy可以使用JS动态元素获取内容?看起来确实需要JS进行分页。我相信我最初对家庭垃圾和物品垃圾的评论会对你有所帮助。
The current highest buy order for "Unusual Cadaver's Cranium" is $12.16
The current highest buy order for "Unusual Backbreaker's Skullcracker" is $13.85
The current highest buy order for "Unusual Hard Counter" is $13.04
The current highest buy order for "Unusual Spiky Viking" is $14.26
The current highest buy order for "Unusual Carouser's Capotain" is $12.72
The current highest buy order for "Unusual Cyborg Stunt Helmet" is $12.89
The current highest buy order for "Unusual Stately Steel Toe" is $12.67
The current highest buy order for "Unusual Bloke's Bucket Hat" is $12.71
The current highest buy order for "Unusual Pugilist's Protector" is $12.94
The current highest buy order for "Unusual Shooter's Sola Topi" is $13.25
>>>