在python中使用selenium刮取动态网页失败_Python_Selenium_Web Scraping

在python中使用selenium刮取动态网页失败

python selenium web-scraping

在python中使用selenium刮取动态网页失败,python,selenium,web-scraping,Python,Selenium,Web Scraping,我正试图从这一页上删除所有5000家公司。当我向下滚动时，它的动态页面和公司被加载。但是我只能刮取5家公司，那么如何刮取全部5000家公司呢？当我向下滚动页面时，URL正在变化。我试过硒，但没用。注意：我想刮公司的所有信息，但刚才选择了两个 import time from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup from selenium import webdriver from

我正试图从这一页上删除所有5000家公司。当我向下滚动时，它的动态页面和公司被加载。但是我只能刮取5家公司，那么如何刮取全部5000家公司呢？当我向下滚动页面时，URL正在变化。我试过硒，但没用。注意：我想刮公司的所有信息，但刚才选择了两个

import time from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup from selenium import webdriver from selenium.webdriver.chrome.options import Options my_url = 'https://www.inc.com/profile/onetrust' options = Options() driver = webdriver.Chrome(chrome_options=options) driver.get(my_url) time.sleep(3) page = driver.page_source driver.quit() uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile") container = containers[0] for container in containers: rank = container.h2.get_text() company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2") Company_name = company_name_1[0].get_text() print("rank :" + rank) print("Company_name :" + Company_name)
更新了代码，但页面根本没有滚动。更正了BeautifulSoup代码中的一些错误

import time from bs4 import BeautifulSoup as soup from selenium import webdriver my_url = 'https://www.inc.com/profile/onetrust' driver = webdriver.Chrome() driver.get(my_url) def scroll_down(self): """A method for scrolling the page.""" # Get scroll height. last_height = self.driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to the bottom. self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load the page. time.sleep(2) # Calculate new scroll height and compare with last scroll height. new_height = self.driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height page_soup = soup(driver.page_source, "html.parser") containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile") container = containers[0] for container in containers: rank = container.h2.get_text() company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2") Company_name = company_name_1[0].get_text() print("rank :" + rank) print("Company_name :" + Company_name)

谢谢你的阅读
试试下面使用python的方法——简单、直接、可靠、快速，在处理请求时需要更少的代码。我在检查了谷歌chrome浏览器的网络部分后，从网站本身获取了API URL
下面的脚本到底在做什么：

首先，它将获取API URL并执行GET请求

获取数据后，脚本将使用JSON.loads库解析JSON数据

最后，它将遍历所有公司列表并打印它们，例如：-排名、公司名称、社交帐户链接、CEO姓名等

import json import requests from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) def scrap_inc_5000(): URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist' response = requests.get(URL,verify = False) result = json.loads(response.text) #Parse result using JSON loads extracted_data = result['fullList']['listCompanies'] for data in extracted_data: print('-' * 100) print('Rank : ',data['rank']) print('Company : ',data['company']) print('Icon : ',data['icon']) print('CEO Name : ',data['ifc_ceo_name']) print('Facebook Address : ',data['ifc_facebook_address']) print('File Location : ',data['ifc_filelocation']) print('Linkedin Address : ',data['ifc_linkedin_address']) print('Twitter Handle : ',data['ifc_twitter_handle']) print('Secondary Link : ',data['secondary_link']) print('-' * 100) scrap_inc_5000()

您可以滚动到页面的末尾，例如，如下所示：或者您可以使用您试图刮取的页面的API，例如，谢谢，我将两者都尝试。请问您是如何找到该页面的API的？当您在浏览器中打开该页面时。您可以在开发者工具部分查看网络呼叫。谢谢！找到了，非常感谢。它起作用了！虽然我是在关注公司的网站，但我看到API json文件中没有数据，这很奇怪。你们知道为什么即使网页上有数据也会发生这样的事情吗？