Python：抓取Instagram IGTV数据，但它只显示前24条记录的信息_Python_Selenium_Beautifulsoup

Python：抓取Instagram IGTV数据，但它只显示前24条记录的信息

python selenium

Python：抓取Instagram IGTV数据，但它只显示前24条记录的信息,python,selenium,beautifulsoup,Python,Selenium,Beautifulsoup,我正在尝试获取instagram IGTV数据，例如视频标题、视图、喜好、评论等。首先，我只使用BeautifulSoup，但我只能获取前12个视频细节。然后我开始使用Selenium，现在我能够获得前24个视频细节。但我得把所有的视频都删掉下面的代码为我提供了前24个视频的超链接，然后我将从每个超链接中删除视频详细信息： import time from bs4 import BeautifulSoup from selenium import webdriver from selenium

我正在尝试获取instagram IGTV数据，例如视频标题、视图、喜好、评论等。首先，我只使用BeautifulSoup，但我只能获取前12个视频细节。然后我开始使用Selenium，现在我能够获得前24个视频细节。但我得把所有的视频都删掉

下面的代码为我提供了前24个视频的超链接，然后我将从每个超链接中删除视频详细信息：

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#import json

url = 'https://www.instagram.com/agt/channel/?hl=en'
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')

#print(soup)
video_links=[]
for a in soup.find_all('a', class_='_bz0w', href=True):
    video_links.append('https://www.instagram.com' + a['href'])
print(video_links)

请建议我如何获取所有视频详细信息。

您可能需要向下滚动以加载更多结果。你可以这样做

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

这样做

将此与找到的答案相结合，以便我们可以向下滚动，直到到达页面末尾：

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#import json

url = 'https://www.instagram.com/agt/channel/?hl=en'
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
SCROLL_PAUSE_TIME = 1

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height


page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')

#print(soup)
video_links=[]

for a in soup.find_all('a', class_='_bz0w', href=True):
    video_links.append('https://www.instagram.com' + a['href'])
print(len(video_links))

即使使用了它，我最多也能得到24个超链接。频道链接有100多个视频请查看我的编辑。这给了我41个链接。可能需要使用scroll\u pause\u time值。此代码返回大约37个结果，但没有任何顺序。我希望得到第一个37个HREF，你可以修改代码，在滚动之间获取结果。我已经尝试过了，但仍然不起作用，伙计。每次执行时，它都会生成不同数量的HREF。