Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/fortran/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在python中使用selenium刮取动态网页失败_Python_Selenium_Web Scraping - Fatal编程技术网

在python中使用selenium刮取动态网页失败

在python中使用selenium刮取动态网页失败,python,selenium,web-scraping,Python,Selenium,Web Scraping,我正试图从这一页上删除所有5000家公司。当我向下滚动时,它的动态页面和公司被加载。但是我只能刮取5家公司,那么如何刮取全部5000家公司呢?当我向下滚动页面时,URL正在变化。我试过硒,但没用。 注意:我想刮公司的所有信息,但刚才选择了两个 import time from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup from selenium import webdriver from

我正试图从这一页上删除所有5000家公司。当我向下滚动时,它的动态页面和公司被加载。但是我只能刮取5家公司,那么如何刮取全部5000家公司呢?当我向下滚动页面时,URL正在变化。我试过硒,但没用。 注意:我想刮公司的所有信息,但刚才选择了两个

import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

my_url = 'https://www.inc.com/profile/onetrust'

options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)
更新了代码,但页面根本没有滚动。更正了BeautifulSoup代码中的一些错误

import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver

my_url = 'https://www.inc.com/profile/onetrust'

driver = webdriver.Chrome()
driver.get(my_url)


def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height


page_soup = soup(driver.page_source, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

谢谢你的阅读

试试下面使用python的方法——简单、直接、可靠、快速,在处理请求时需要更少的代码。我在检查了谷歌chrome浏览器的网络部分后,从网站本身获取了API URL

下面的脚本到底在做什么:

  • 首先,它将获取API URL并执行GET请求

  • 获取数据后,脚本将使用JSON.loads库解析JSON数据

  • 最后,它将遍历所有公司列表并打印它们,例如:-排名、公司名称、社交帐户链接、CEO姓名等

    import json
    import requests
    from urllib3.exceptions import InsecureRequestWarning
    requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
    
    def scrap_inc_5000():
    
    URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist'
    
    response = requests.get(URL,verify = False)
    result = json.loads(response.text) #Parse result using JSON loads
    extracted_data = result['fullList']['listCompanies']
    for data in extracted_data:
        print('-' * 100)
        print('Rank : ',data['rank'])
        print('Company : ',data['company'])
        print('Icon : ',data['icon'])
        print('CEO Name : ',data['ifc_ceo_name'])
        print('Facebook Address : ',data['ifc_facebook_address'])
        print('File Location : ',data['ifc_filelocation'])
        print('Linkedin Address : ',data['ifc_linkedin_address'])
        print('Twitter Handle : ',data['ifc_twitter_handle'])
        print('Secondary Link : ',data['secondary_link'])
        print('-' * 100)
    scrap_inc_5000()
    

  • 您可以滚动到页面的末尾,例如,如下所示:或者您可以使用您试图刮取的页面的API,例如,谢谢,我将两者都尝试。请问您是如何找到该页面的API的?当您在浏览器中打开该页面时。您可以在开发者工具部分查看网络呼叫。谢谢!找到了,非常感谢。它起作用了!虽然我是在关注公司的网站,但我看到API json文件中没有数据,这很奇怪。你们知道为什么即使网页上有数据也会发生这样的事情吗?