Python 使用beautifulsoup从youtube频道获取链接时出现问题
我正试图抓取一个youtube频道并返回该频道每个视频的所有链接,但是当我试图打印这些链接时,我只得到了一些与视频无关的链接。我怀疑这些视频可能是通过Javascript加载的,那么我们有没有办法用beautifulsoup实现这一点呢?我必须使用硒吗?有人能帮我做些测试吗。以下是我目前的代码:Python 使用beautifulsoup从youtube频道获取链接时出现问题,python,python-3.x,beautifulsoup,youtube,python-requests,Python,Python 3.x,Beautifulsoup,Youtube,Python Requests,我正试图抓取一个youtube频道并返回该频道每个视频的所有链接,但是当我试图打印这些链接时,我只得到了一些与视频无关的链接。我怀疑这些视频可能是通过Javascript加载的,那么我们有没有办法用beautifulsoup实现这一点呢?我必须使用硒吗?有人能帮我做些测试吗。以下是我目前的代码: import requests from bs4 import BeautifulSoup print('scanning page...') youtuber = 'memeulous' resu
import requests
from bs4 import BeautifulSoup
print('scanning page...')
youtuber = 'memeulous'
result = requests.get('https://www.youtube.com/c/' + youtuber + '/videos')
status = result.status_code
src = result.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all('a')
if status == 200:
print('valid URL, grabbing uploads...')
else:
print('invalid URL, status code: ' + str(status))
quit()
print(links)
这是我的输出:
scanning page...
valid URL, grabbing uploads...
[<a href="https://www.youtube.com/about/" slot="guide-links-primary" style="display: none;">About</a>, <a href="https://www.youtube.com/about/press/" slot="guide-links-primary" style="display: none;">Press</a>, <a href="https://www.youtube.com/about/copyright/" slot="guide-links-primary" style="display: none;">Copyright</a>, <a href="/t/contact_us" slot="guide-links-primary" style="display: none;">Contact us</a>, <a href="https://www.youtube.com/creators/" slot="guide-links-primary" style="display: none;">Creators</a>, <a href="https://www.youtube.com/ads/" slot="guide-links-primary" style="display: none;">Advertise</a>, <a href="https://developers.google.com/youtube" slot="guide-links-primary" style="display: none;">Developers</a>, <a href="/t/terms" slot="guide-links-secondary" style="display: none;">Terms</a>, <a href="https://www.google.co.uk/intl/en-GB/policies/privacy/" slot="guide-links-secondary" style="display: none;">Privacy</a>, <a href="https://www.youtube.com/about/policies/" slot="guide-links-secondary" style="display: none;">Policy and Safety</a>, <a href="https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&utm_source=ythp&utm_medium=LeftNav&utm_content=txt&u=https%3A%2F%2Fwww.youtube.com%2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen" slot="guide-links-secondary" style="display: none;">How YouTube works</a>, <a href="/new" slot="guide-links-secondary" style="display: none;">Test new features</a>]
[Finished in 4.0s]
扫描页面。。。
有效的URL,抓取上传。。。
[, , , , , , ]
[在4.0秒内完成]
如您所见,没有视频链接。一种方法是使用以下代码:
import requests
api_key = "PASTE_YOUR_API_KEY_HERE!"
yt_user = "memeulous"
api_url = f"https://www.googleapis.com/youtube/v3/channels?part=contentDetails&forUsername={yt_user}&key={api_key}"
response = requests.get(api_url).json()
playlist_id = response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]
channel_url = f"https://www.googleapis.com/youtube/v3/playlistItems?" \
f"part=snippet%2CcontentDetails&maxResults=50&playlistId={playlist_id}&key={api_key}"
def get_video_ids(vid_data: dict) -> list:
return [_id["contentDetails"]["videoId"] for _id in vid_data["items"]]
def build_links(vid_ids: list) -> list:
return [f"https://www.youtube.com/watch?v={_id}" for _id in vid_ids]
def get_all_links() -> list:
all_links = []
url = channel_url
while True:
res = requests.get(url).json()
all_links.extend(build_links(get_video_ids(res)))
try:
paging_token = res["nextPageToken"]
url = f"{channel_url}&pageToken={paging_token}"
except KeyError:
break
return all_links
print(get_all_links())
这将获得memeulus
用户的所有视频链接(469
)
['https://www.youtube.com/watch?v=4L8_isnyGfg', 'https://www.youtube.com/watch?v=ogpaiD2e-ss', 'https://www.youtube.com/watch?v=oH-nJe9XMN0', 'https://www.youtube.com/watch?v=kUcbKl4qe5g', ...
您可以从videos\u data
对象中获取总视频计数,如下所示:
print(f“全部视频:{videos_data['pageInfo']['totalResults']}”)
我希望这对你有所帮助,并能让你开始。您所需要做的就是获取YouTube数据API的API密钥。您不必使用selenium,但可能需要使用YouTube的API。你得不到任何东西,因为该网站是由JS动态呈现的。@baduker使用beautifulsoup会有任何方法来实现这一点吗?我不知道如何使用API,也不知道它有多困难。它没有你想象的那么困难。Python中有很多用于YouTube API的包装器。不,仅仅用bs4是无法得到你想要的东西的。这是一个快速启动-谢谢,我会检查它。我如何让它显示所有的视频,而不仅仅是50个。然后你会使用一种叫做分页令牌的东西。我已经更新了答案,所以它可以获取所有的视频。