Python 需要了解为什么BeautifulSoup无法使用类查询元素_Python_Web Scraping_Beautifulsoup

Python 需要了解为什么BeautifulSoup无法使用类查询元素

python web-scraping

Python 需要了解为什么BeautifulSoup无法使用类查询元素,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,对于这个简单的BeautifulSoup实验，我试图从IMDB页面中获取一些简单的数据https://www.imdb.com/title/tt7069210/ 问题是我无法获取带有类rec_item的元素。我尝试了许多选择器来获取它，但每次它都返回一个空白列表现在，我觉得奇怪的原因是：带有rec_项的元素不在任何iFrame内。可以通过在浏览器上查看页面源代码来查看元素。因此，根据我的理解，在页面加载之后，它们不会被javascript加载。这是你的电话号码问题：有人能帮我理解为什么

对于这个简单的BeautifulSoup实验，我试图从IMDB页面中获取一些简单的数据https://www.imdb.com/title/tt7069210/

问题是我无法获取带有类rec_item的元素。我尝试了许多选择器来获取它，但每次它都返回一个空白列表

现在，我觉得奇怪的原因是：

带有rec_项的元素不在任何iFrame内。可以通过在浏览器上查看页面源代码来查看元素。因此，根据我的理解，在页面加载之后，它们不会被javascript加载。这是你的电话号码

问题：有人能帮我理解为什么rec_项目列表是空白的吗

补充资料

这是密码

from bs4 import BeautifulSoup
import requests


def extract(url):
    res = requests.get(url)
    bsoup = BeautifulSoup(res.text, 'html.parser')
    the_title = bsoup.select('meta[name="title"]')[0].attrs['content']
    print('Title: ' + the_title)    # This works fine

    long_text = bsoup.select('#titleStoryLine .inline.canwrap span')[0].string.strip()
    print('Description: ' + long_text)    # this too works fine

    similar_movies = bsoup.select('.rec_item')
    print(similar_movies)   # blank array :(


extract('https://www.imdb.com/title/tt7069210/')

浏览器的查看页面源

这是repl.it的输出

您必须添加标题才能获得正确的HTML，而不是三级机器人想要的超文本

以下是如何做到这一点：

导入请求从bs4导入BeautifulSoup 标题={ 用户代理：Mozilla/5.0 Macintosh；Intel Mac OS X 10_15_7 AppleWebKit/537.36 KHTML，如Gecko Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36 } def提取URL： res=requests.geturl，headers=headers soup=BeautifulSoupres.text，“html.parser” _title=soup。选择“meta[name=title]”[0]。属性['content'] 打印“标题：”+这个标题很好用 long_text=soup。选择“titleStoryLine.inline.canwrap span”[0]。string.strip 打印“说明：”+长文本这也很好相似的电影=汤。选择“.rec\u item img” 现在打印[i[title]为类似电影中的i]作品：摘录'https://www.imdb.com/title/tt7069210/' 输出：

Title: The Conjuring 3: The Devil Made Me Do It (2021) - IMDb
Description: A chilling story of terror, murder and unknown evil that shocked even experienced real-life paranormal investigators Ed and Lorraine Warren. One of the most sensational cases from their files, it starts with a fight for the soul of a young boy, then takes them beyond anything they'd ever seen before, to mark the first time in U.S. history that a murder suspect would claim demonic possession as a defense.
['The Conjuring 2', 'The Conjuring 2 Remake', 'The Conjuring', 'The Maiden', 'Conjuring the Devil', 'Billie Eilish: Bury a Friend', 'Oxygen', 'The Curse of La Llorona', 'Annabelle Comes Home', 'Shang-Chi and the Legend of the Ten Rings', 'Malignant', 'The Nun']

推荐是用js动态加载的，它们不在您下载的正文中。您不能对请求执行此操作，请尝试使用selenium。@Lucas，但它们在查看页面源代码中可用，这使它对我来说很神秘。正如这个SO链接下面的公认答案所说：浏览器中的View Source向您显示页面的原始HTML源代码——这正是在客户端进行任何修改之前从服务器获得的内容。因此，它不会包含javascript对页面所做的任何动态更改。[不是某个三级机器人想要的超文本：D非常感谢，先生：。因此，服务器正在剥离一些HTML，因为它感觉到请求不是来自真正的浏览器？你能告诉我一个文档/url，在那里我可以了解有关此行为的更多信息吗？这将是一个很大的帮助。再次感谢，伙计。。。