无法在网页上刮取ajax加载的元素
我需要刮一个网页的链接是 在这个网页中有一个交叉引用部分,我想删去,但当我使用python请求通过以下代码收集页面内容时:无法在网页上刮取ajax加载的元素,ajax,selenium,web-scraping,beautifulsoup,python-requests,Ajax,Selenium,Web Scraping,Beautifulsoup,Python Requests,我需要刮一个网页的链接是 在这个网页中有一个交叉引用部分,我想删去,但当我使用python请求通过以下代码收集页面内容时: url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
结果内容没有交叉引用部分,可能bcz没有加载。我可以刮除html内容的其余部分,但没有交叉引用部分。现在,当我用selenium做同样的事情时,它工作得很好,这意味着selenium能够在加载后找到这个元素。
有人能告诉我如何使用python请求和beautifulsoup而不是selenium来完成这项工作吗?数据是通过Javascript加载的,但您可以使用
请求
、beautifulsoup
和json
模块来提取数据:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )
d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
if item['componentName'] == 'PdpWrapper':
d = item
break
if d:
cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
print(json.dumps(cross_reverence_product_tiles, indent=4))
仅此一项就足以刮取交叉引用部分,以便查看所有元素所在的位置()
,您可以使用以下任一项:
- 使用
:CSS\u选择器
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
- 使用
:XPATH
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='WideSidebarProductList-list']//h4")))])
- 注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
- 控制台输出:
['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']
打印(soup.select_one(“#arrow state”).text)
时,您将看到文本已编码-在json
模块可以解析它之前,引号符号(&q;
,&g;
等)需要用各自的字符替换。正如我在问题中提到的,我已经用硒做过了。我想使用请求和beautifulsoup来完成它。无论如何,谢谢你。
['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']