使用Python在：：before标记后刮取网页内容_Python_Python 3.x_Web Scraping_Beautifulsoup

使用Python在：：before标记后刮取网页内容

python python-3.x web-scraping

使用Python在：：before标记后刮取网页内容,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,[网页样本代码][1] 我正在使用BeautifulSoup和Python从以下网站[Knapply][2]中提取文章标题在使用BS4导入html数据时，后面的部分 <ul id="articleview" class="clearfix" style="padding-left: 0"> ::before ：：之前没有被导入。如图所示 <li class="col-sm-6 col-md-4 c

[网页样本代码][1]

我正在使用BeautifulSoup和Python从以下网站[Knapply][2]中提取文章标题

在使用BS4导入html数据时，后面的部分

<ul id="articleview" class="clearfix" style="padding-left: 0">
::before


：：之前

没有被导入。如图所示

<li class="col-sm-6 col-md-4 col-lg-4">

以下标记包含有关文章的详细信息。但是我无法使用find/find_all找到这些

标记

任何建议都会有帮助。 [1]:

[2] ：

页面是使用

javascript

动态加载的，因此您必须使用

selenium

：

from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get('https://knappily.com/')

time.sleep(3)

st = time.time()

while True:
    if time.time() - st <= 120: #Keeps loading more articles for 2 mins. You can increase the time if u want.
        try:
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="buttons"]'))).click()
            time.sleep(1)
        except:
            break
    else:
        break

main_art = driver.find_element_by_class_name('impr-content').find_element_by_xpath('.//h1').text

print(f"{'-'*80}\nMain Article:\n{'-'*80}\n{main_art}\n{'-'*80}\nOther Articles:")

articles = driver.find_elements_by_class_name('article-content')

for article in articles:
    print('-'*80)
    print(article.find_element_by_xpath('.//h3').text)

页面是使用

javascript

动态加载的，因此您必须使用

selenium

：

from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get('https://knappily.com/')

time.sleep(3)

st = time.time()

while True:
    if time.time() - st <= 120: #Keeps loading more articles for 2 mins. You can increase the time if u want.
        try:
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="buttons"]'))).click()
            time.sleep(1)
        except:
            break
    else:
        break

main_art = driver.find_element_by_class_name('impr-content').find_element_by_xpath('.//h1').text

print(f"{'-'*80}\nMain Article:\n{'-'*80}\n{main_art}\n{'-'*80}\nOther Articles:")

articles = driver.find_elements_by_class_name('article-content')

for article in articles:
    print('-'*80)
    print(article.find_element_by_xpath('.//h3').text)

非常感谢。我对Selenium是新手，尝试了一些代码，但结果都是空的。Bdw如何使用Selenium驱动程序按下load more按钮并提取后续页面的内容。再次非常感谢。我是否应该通过单击“加载更多”按钮来帮助您加载更多文章？顺便说一句，谢谢你接受我的ans作为最好的ans！查看我的最新编辑。我添加了如何保持向下滚动以加载更多内容。非常感谢。我对Selenium是新手，尝试了一些代码，但结果都是空的。Bdw如何使用Selenium驱动程序按下load more按钮并提取后续页面的内容。再次非常感谢。我是否应该通过单击“加载更多”按钮来帮助您加载更多文章？顺便说一句，谢谢你接受我的ans作为最好的ans！查看我的最新编辑。我添加了如何保持向下滚动以加载更多内容。

>>> len(articles)
714