Python driver.find_element_通过在页面末尾缓慢运行的_css_选择器
我有一个来自selenium import webdriver的Python driver.find_element_通过在页面末尾缓慢运行的_css_选择器,python,driver,Python,Driver,我有一个来自selenium import webdriver的,temperature web scraper使用Python工作。页面开头的刮纸器几乎可以立即找到正确的高温和低温。但是,在接近页面末尾时,速度会越来越慢(接近结尾时需要7秒)。这可能是因为scraper必须通过更多的HTML来找到正确的数据(?)。以下是代码的主要部分: high = driver.find_element_by_css_selector('#twc-scrollabe > table > tbod
,
temperature web scraper使用Python工作。页面开头的刮纸器几乎可以立即找到正确的高温和低温。但是,在接近页面末尾时,速度会越来越慢(接近结尾时需要7秒)。这可能是因为scraper必须通过更多的HTML来找到正确的数据(?)。以下是代码的主要部分:
high = driver.find_element_by_css_selector('#twc-scrollabe > table > tbody > tr:nth-child(' + str(j) + ') > td.temp > div > span:nth-child(1)').text
low = driver.find_element_by_css_selector('#twc-scrollabe > table > tbody > tr:nth-child(' + str(j) + ') > td.temp > div > span:nth-child(3)').text
date = driver.find_element_by_css_selector('#twc-scrollabe > table > tbody > tr:nth-child(' + str(j) + ') > td:nth-child(2) > div > span').text
#auth > div > div > div > div > div > form > button
#twc-scrollabe > table > tbody > tr:nth-child(1) > td:nth-child(2) > div > span
#twc-scrollabe > table > tbody > tr:nth-child(2) > td:nth-child(2) > div > span
有没有简单(或复杂)的解决方法?如果你认为没有简单的解决方案,那也会很有帮助(可能是这样的?你想要的内容是由JavaScript生成的吗?如果只是HTML,您可以避免使用无头浏览器,并使用
请求
和bs4
:
$ python test.py
Got response: 200
Today JUN 1 80°/61°
Sun JUN 2 70°/47°
Mon JUN 3 63°/45°
Tue JUN 4 74°/57°
Wed JUN 5 75°/64°
Thu JUN 6 77°/63°
Fri JUN 7 77°/64°
Sat JUN 8 81°/66°
Sun JUN 9 81°/65°
Mon JUN 10 80°/63°
Tue JUN 11 80°/63°
Wed JUN 12 81°/62°
Thu JUN 13 80°/63°
Fri JUN 14 81°/63°
Sat JUN 15 81°/63°
Total: 0.66s, request: 0.60s
test.py
import requests
import time
from bs4 import BeautifulSoup
URL = 'https://weather.com/weather/tenday/l/USPA1290:1:US'
def fetch(url):
with requests.Session() as s:
r = s.get(URL, timeout=5)
return r
def main():
start_t = time.time()
resp = fetch(URL)
print(f'Got response: {resp.status_code}')
html = resp.text
bs = BeautifulSoup(html, 'html.parser')
tds = bs.find_all('td', class_='twc-sticky-col', attrs={'headers': 'day'})
for td in tds:
date_time = td.find_next('span', class_='date-time')
day_detail = td.find_next('span', class_='day-detail')
temp = td.find_next('td', class_='temp', attrs={'headers': 'hi-lo'})
hi_lo = '/'.join(i.text for i in temp.find_all('span', class_=''))
print(f'{date_time.text:5} {day_detail.text:6} {hi_lo}')
end_t = time.time()
elapsed_t = end_t - start_t
r_time = resp.elapsed.total_seconds()
print(f'Total: {elapsed_t:.2f}s, request: {r_time:.2f}s')
if __name__ == '__main__':
main()
这页有多大?七秒钟似乎慢得要命。这个代码是循环的吗?是的,它在美国最大的30个城市循环。对于范围(0,len(city_url)):print(pd.Timestamp.today())driver.get(city_url[i])转到每个城市的url。打印(city_url[i])打印(pd.Timestamp.today()。