Web 在python中使用美丽的汤进行网页抓取

Web 在python中使用美丽的汤进行网页抓取,web,scripting,beautifulsoup,Web,Scripting,Beautifulsoup,我想抓取youtube的主页,拉出视频的所有链接。下面是代码 from bs4 import BeautifulSoup import requests s='https://www.youtube.com/' html=requests.get(s) html=html.text s=BeautifulSoup(html,features="html.parser") for e in s.find_all('a',{'id':'video-title'}): link=e.ge

我想抓取youtube的主页,拉出视频的所有链接。下面是代码

from bs4 import BeautifulSoup
import requests

s='https://www.youtube.com/'
html=requests.get(s)
html=html.text

s=BeautifulSoup(html,features="html.parser")

for e in s.find_all('a',{'id':'video-title'}):
    link=e.get('href')
    text=e.string
    print(text)
    print(link)
    print()

当我运行上面的代码时,没有发生任何事情。似乎没有发现身份证。我做错了什么

这是因为您没有得到与浏览器相同的HTML

import requests
from bs4 import BeautifulSoup


s =  requests.get("https://youtube.com").text

soup = BeautifulSoup(s,'lxml')

print(soup)
将此代码的输出保存到名为
test.html
的文件中,然后运行。您将看到它与浏览器的不同,因为它看起来已损坏

请看下面的问题


基本上,我建议您使用Selenium Webdriver,因为它可以作为浏览器使用。

这是因为您没有获得与浏览器相同的HTML

import requests
from bs4 import BeautifulSoup


s =  requests.get("https://youtube.com").text

soup = BeautifulSoup(s,'lxml')

print(soup)
将此代码的输出保存到名为
test.html
的文件中,然后运行。您将看到它与浏览器的不同,因为它看起来已损坏

请看下面的问题


基本上,我建议您使用Selenium Webdriver,因为它可以作为浏览器使用。

是的,这是一个奇怪的刮取,但是如果您在“div id=“content”级别刮取,您就可以获得您请求的数据。我可以获得每个视频的标题,但youtube似乎有一些速率限制或限制,因此我认为您无法获得所有的标题和链接。无论如何,以下是我为这些书名所做的工作:

import requests
from bs4 import BeautifulSoup

url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')

for each in links:
    print(each.text)

是的,这是一个奇怪的刮取,但是如果您在“div id=“content”级别刮取,您就能够获得您请求的数据。我可以获得每个视频的标题,但youtube似乎有一些速率限制或限制,因此我认为您无法获得所有的标题和链接。无论如何,以下是我为这些书名所做的工作:

import requests
from bs4 import BeautifulSoup

url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')

for each in links:
    print(each.text)

这可能有助于从youtube主页上删除所有视频

from bs4 import BeautifulSoup import requests r = 'https://www.youtube.com/' html = requests.get(r) all_videos = [] soup = BeautifulSoup(html.text, 'html.parser') for i in soup.find_all('a'): if i.has_attr('href'): text = i.attrs.get('href') if text.startswith('/watch?'): urls = r+text all_videos.append(urls) print('Total Videos', len(all_videos)) print('LIST OF VIDEOS', all_videos) 从bs4导入BeautifulSoup 导入请求 r='https://www.youtube.com/' html=requests.get(r) 所有视频=[] soup=BeautifulSoup(html.text,'html.parser') 因为我在汤里。找到所有的('a'): 如果i.has_attr('href'): text=i.attrs.get('href') 如果text.startswith(“/watch?”): URL=r+文本 所有视频。附加(URL) 打印('总视频',len(所有视频)) 打印(“视频列表”,所有视频)
这可能有助于从youtube主页上删除所有视频

from bs4 import BeautifulSoup import requests r = 'https://www.youtube.com/' html = requests.get(r) all_videos = [] soup = BeautifulSoup(html.text, 'html.parser') for i in soup.find_all('a'): if i.has_attr('href'): text = i.attrs.get('href') if text.startswith('/watch?'): urls = r+text all_videos.append(urls) print('Total Videos', len(all_videos)) print('LIST OF VIDEOS', all_videos) 从bs4导入BeautifulSoup 导入请求 r='https://www.youtube.com/' html=requests.get(r) 所有视频=[] soup=BeautifulSoup(html.text,'html.parser') 因为我在汤里。找到所有的('a'): 如果i.has_attr('href'): text=i.attrs.get('href') 如果text.startswith(“/watch?”): URL=r+文本 所有视频。附加(URL) 打印('总视频',len(所有视频)) 打印(“视频列表”,所有视频)
此代码段将从
youtube.com
主页中选择其
href
属性中包含
/watch?
的所有链接(视频链接):

印刷品:

https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU

...and so on

此代码段将从
youtube.com
主页中选择其
href
属性中包含
/watch?
的所有链接(视频链接):

印刷品:

https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU

...and so on

我们不是必须以字典的形式给出id吗{'id':'content'}我们不是必须以字典的形式给出id吗{'id':'content'}