在python中使用selenium刮取动态网页失败
我正试图从这一页上删除所有5000家公司。当我向下滚动时,它的动态页面和公司被加载。但是我只能刮取5家公司,那么如何刮取全部5000家公司呢?当我向下滚动页面时,URL正在变化。我试过硒,但没用。 注意:我想刮公司的所有信息,但刚才选择了两个在python中使用selenium刮取动态网页失败,python,selenium,web-scraping,Python,Selenium,Web Scraping,我正试图从这一页上删除所有5000家公司。当我向下滚动时,它的动态页面和公司被加载。但是我只能刮取5家公司,那么如何刮取全部5000家公司呢?当我向下滚动页面时,URL正在变化。我试过硒,但没用。 注意:我想刮公司的所有信息,但刚才选择了两个 import time from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup from selenium import webdriver from
import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
my_url = 'https://www.inc.com/profile/onetrust'
options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]
for container in containers:
rank = container.h2.get_text()
company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
Company_name = company_name_1[0].get_text()
print("rank :" + rank)
print("Company_name :" + Company_name)
更新了代码,但页面根本没有滚动。更正了BeautifulSoup代码中的一些错误
import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver
my_url = 'https://www.inc.com/profile/onetrust'
driver = webdriver.Chrome()
driver.get(my_url)
def scroll_down(self):
"""A method for scrolling the page."""
# Get scroll height.
last_height = self.driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to the bottom.
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load the page.
time.sleep(2)
# Calculate new scroll height and compare with last scroll height.
new_height = self.driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
page_soup = soup(driver.page_source, "html.parser")
containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]
for container in containers:
rank = container.h2.get_text()
company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
Company_name = company_name_1[0].get_text()
print("rank :" + rank)
print("Company_name :" + Company_name)
谢谢你的阅读 试试下面使用python的方法——简单、直接、可靠、快速,在处理请求时需要更少的代码。我在检查了谷歌chrome浏览器的网络部分后,从网站本身获取了API URL 下面的脚本到底在做什么:
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_inc_5000():
URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist'
response = requests.get(URL,verify = False)
result = json.loads(response.text) #Parse result using JSON loads
extracted_data = result['fullList']['listCompanies']
for data in extracted_data:
print('-' * 100)
print('Rank : ',data['rank'])
print('Company : ',data['company'])
print('Icon : ',data['icon'])
print('CEO Name : ',data['ifc_ceo_name'])
print('Facebook Address : ',data['ifc_facebook_address'])
print('File Location : ',data['ifc_filelocation'])
print('Linkedin Address : ',data['ifc_linkedin_address'])
print('Twitter Handle : ',data['ifc_twitter_handle'])
print('Secondary Link : ',data['secondary_link'])
print('-' * 100)
scrap_inc_5000()
您可以滚动到页面的末尾,例如,如下所示:或者您可以使用您试图刮取的页面的API,例如,谢谢,我将两者都尝试。请问您是如何找到该页面的API的?当您在浏览器中打开该页面时。您可以在开发者工具部分查看网络呼叫。谢谢!找到了,非常感谢。它起作用了!虽然我是在关注公司的网站,但我看到API json文件中没有数据,这很奇怪。你们知道为什么即使网页上有数据也会发生这样的事情吗?