Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ajax/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
无法在网页上刮取ajax加载的元素_Ajax_Selenium_Web Scraping_Beautifulsoup_Python Requests - Fatal编程技术网

无法在网页上刮取ajax加载的元素

无法在网页上刮取ajax加载的元素,ajax,selenium,web-scraping,beautifulsoup,python-requests,Ajax,Selenium,Web Scraping,Beautifulsoup,Python Requests,我需要刮一个网页的链接是 在这个网页中有一个交叉引用部分,我想删去,但当我使用python请求通过以下代码收集页面内容时: url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70

我需要刮一个网页的链接是 在这个网页中有一个交叉引用部分,我想删去,但当我使用python请求通过以下代码收集页面内容时:

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
结果内容没有交叉引用部分,可能bcz没有加载。我可以刮除html内容的其余部分,但没有交叉引用部分。现在,当我用selenium做同样的事情时,它工作得很好,这意味着selenium能够在加载后找到这个元素。
有人能告诉我如何使用python请求和beautifulsoup而不是selenium来完成这项工作吗?

数据是通过Javascript加载的,但您可以使用
请求
beautifulsoup
json
模块来提取数据:

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    }

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )

d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
    if item['componentName'] == 'PdpWrapper':
        d = item
        break

if d:
    cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
    print(json.dumps(cross_reverence_product_tiles, indent=4))
仅此一项就足以刮取交叉引用部分,以便
查看所有元素所在的位置()
,您可以使用以下任一项:

  • 使用
    CSS\u选择器

      print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
    
  • 使用
    XPATH

      print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='WideSidebarProductList-list']//h4")))])
    
  • 注意:您必须添加以下导入:

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
    
  • 控制台输出:

      ['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']
    

谢谢你,伙计。我明白了,但你能给我解释一下更换的零件吗。谢谢你的邀请reply@A.Hamza当您执行
打印(soup.select_one(“#arrow state”).text)
时,您将看到文本已编码-在
json
模块可以解析它之前,引号符号(
&q;
&g;
等)需要用各自的字符替换。正如我在问题中提到的,我已经用硒做过了。我想使用请求和beautifulsoup来完成它。无论如何,谢谢你。
  ['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']