Python IMDB的Web抓取无法检索所需的列

Python IMDB的Web抓取无法检索所需的列,python,web-scraping,html-parsing,user-agent,Python,Web Scraping,Html Parsing,User Agent,我曾在IMDB网站上尝试过网络抓取。我正在寻找前50部恐怖电影。我想抓取电影名,评级,导演名,流派,以及运行时 我检查了元素的电影名称 检查评级元素和董事姓名 检查元素的运行时间、类型 我在检查了这些元素的标题、导演姓名、评级、运行时和流派之后,编写了一段代码 headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)

我曾在IMDB网站上尝试过网络抓取。我正在寻找
前50部恐怖电影
。我想抓取
电影名
评级
导演名
流派
,以及
运行时

我检查了元素的电影名称

检查评级元素和董事姓名

检查元素的运行时间、类型

我在检查了这些元素的标题、导演姓名、评级、运行时和流派之后,编写了一段代码

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'
#r = requests.get(my_url, headers=headers)#, proxies=proxies)
request=urllib.request.Request(my_url,None,headers)
response = urllib.request.urlopen(request)
page_html = response.read()
page_soup = BeautifulSoup(page_html,"html.parser")
page_soup.h1
page_soup.body.span
containers = page_soup.findAll("div",{"class":"lister-item mode-advanced"})
print(len(containers))

for container in containers:
  title=container.findAll("a",{"class": "lister-item-index unbold-text-primary"})
  rating = container.findAll("div",{"class":"inline-block.ratings-imdb-rating"})
  duration = container.findAll("span",{"class":"runtime"})
  genre = container.findAll("span",{"class":"genre"})
  director = container.findAll("p",{"class":"text-muted"})

print(title)
print(rating)
print(duration)
print(genre)
print(director) 
但是,我的代码无法检索这些属性

输出:

50
[]
[]
[<span class="runtime">90 min</span>]
[<span class="genre">
Horror, Mystery, Thriller            </span>]
[<p class="text-muted ">
<span class="runtime">90 min</span>
<span class="ghost">|</span>
<span class="genre">
Horror, Mystery, Thriller            </span>
</p>, <p class="text-muted">
    A decades-old folk tale surrounding a deranged murderer killing those who celebrate Valentine's Day turns out to be true to legend when a group defies the killer's order and people start turning up dead.</p>]
50
[]
[]
[90分钟]
[
恐怖、神秘、惊悚]
[

90分钟 | 恐怖、神秘、惊悚 ,

一个有着数十年历史的民间故事,讲述了一个疯狂的杀人犯杀害了庆祝情人节的人。当一群人违抗杀人犯的命令,人们开始死亡时,这个故事就成为了一个真实的传说。

]

如果有人能帮我找出我遗漏了什么,那会很有帮助。

你没有正确处理你的列表。必须更具体地说明标签和搜索数据的方式。并将
findall
更改为
find

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'
page = requests.get(my_url, headers=headers)
page_soup = BeautifulSoup(page.text,"html.parser")
for container in containers:
  print(container.find("a", href=re.compile('adv_li_tt')).text)
  print(container.find("strong").text)
  print(container.find("span",{"class":"runtime"}).text)
  print(container.find("span",{"class":"genre"}).text.strip())
  print(container.find('a', href=re.compile('adv_li_dr_0')).text)
  print('\n')
输出

Wrong Turn
5.4
109 min
Horror, Thriller
Mike P. Nelson


Willy's Wonderland
5.7
88 min
Action, Comedy, Horror
Kevin Lewis


Red Dot
5.5
86 min
Drama, Horror, Thriller
Alain Darborg

您没有正确处理列表。必须更具体地说明标签和搜索数据的方式。并将
findall
更改为
find

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'
page = requests.get(my_url, headers=headers)
page_soup = BeautifulSoup(page.text,"html.parser")
for container in containers:
  print(container.find("a", href=re.compile('adv_li_tt')).text)
  print(container.find("strong").text)
  print(container.find("span",{"class":"runtime"}).text)
  print(container.find("span",{"class":"genre"}).text.strip())
  print(container.find('a', href=re.compile('adv_li_dr_0')).text)
  print('\n')
输出

Wrong Turn
5.4
109 min
Horror, Thriller
Mike P. Nelson


Willy's Wonderland
5.7
88 min
Action, Comedy, Horror
Kevin Lewis


Red Dot
5.5
86 min
Drama, Horror, Thriller
Alain Darborg

HTML就像一个树状结构。您希望找到父节点,然后遍历这些节点以获取其中的内容。这个网站是非常好的实践。Director是唯一一个棘手的部分,因为它位于
标记中,但没有用于区分它的属性。所以你需要做一点逻辑来得到它。(注意,您可以使用regex来查找它,但因为您正在学习,所以希望向您展示一个循环)。我还附加了图像,以便您可以看到我从何处获得这些标记和属性:

import requests
from bs4 import BeautifulSoup


headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'

response = requests.get(my_url, headers=headers)
page_html = response.text
page_soup = BeautifulSoup(page_html,"html.parser")


movies = page_soup.find_all('div',{'class':'lister-item-content'})
for movie in movies:
    title = movie.find('h3').find('a').text
    try:
        rating = movie.find('p').find('span', {'class':'certificate'}).text
    except:
        rating = ''
    genre = movie.find('p').find('span', {'class':'genre'}).text.strip()
    try:
        runtime = movie.find('p').find('span', {'class':'runtime'}).text
    except:
        runtime = ''
    ps = movie.find_all('p')
    for p in ps:
        if 'Director'in p.text:
            director =p.find('a').text
            
    print(title, rating, genre, runtime, director)
输出:

Wrong Turn 18 Horror, Thriller 109 min Mike P. Nelson
Willy's Wonderland 15 Action, Comedy, Horror 88 min Kevin Lewis
Red Dot 15 Drama, Horror, Thriller 86 min Alain Darborg
Saint Maud 15 Drama, Horror, Mystery 84 min Rose Glass
Freaky 15 Comedy, Horror, Thriller 102 min Christopher Landon
Doctor Strange in the Multiverse of Madness  Action, Adventure, Fantasy  Sam Raimi
Midsommar 18 Drama, Horror, Mystery 148 min Ari Aster
Fear of Rain PG-13 Drama, Horror, Thriller 109 min Castille Landon
The Little Stranger 12A Drama, Horror, Mystery 111 min Lenny Abrahamson
Army of the Dead R Action, Crime, Horror  Zack Snyder
Get Out 15 Horror, Mystery, Thriller 104 min Jordan Peele
Synchronic 15 Drama, Horror, Sci-Fi 102 min Justin Benson
The Rental 15 Drama, Horror, Mystery 88 min Dave Franco
Shadow in the Cloud R Action, Horror, War 83 min Roseanne Liang
Don't Worry Darling  Horror, Thriller  Olivia Wilde
Venom: Let There Be Carnage  Action, Horror, Sci-Fi  Andy Serkis
The Shining 15 Drama, Horror 146 min Stanley Kubrick
The Witch 15 Drama, Horror, Mystery 92 min Robert Eggers
Split 15 Horror, Thriller 117 min M. Night Shyamalan
Hereditary 15 Drama, Horror, Mystery 127 min Ari Aster
Wrong Turn 18 Horror, Thriller 84 min Rob Schmidt
Antebellum 15 Drama, Horror, Mystery 105 min Gerard Bush
Possessor 18 Horror, Sci-Fi, Thriller 103 min Brandon Cronenberg
The New Mutants 15 Action, Horror, Sci-Fi 94 min Josh Boone
Doctor Sleep 15 Drama, Fantasy, Horror 152 min Mike Flanagan
The Invisible Man R Drama, Horror, Mystery 124 min Leigh Whannell
The Meg 12A Action, Horror, Sci-Fi 113 min Jon Turteltaub
Alien X Horror, Sci-Fi 117 min Ridley Scott
The Lighthouse 15 Drama, Fantasy, Horror 109 min Robert Eggers
Scream  Horror, Mystery, Thriller  Matt Bettinelli-Olpin
Run PG-13 Horror, Mystery, Thriller 90 min Aneesh Chaganty
Porno 18 Comedy, Horror 98 min Keola Racela
The Hunt 15 Action, Horror, Thriller 90 min Craig Zobel
Becky 18 Action, Crime, Drama 93 min Jonathan Milott
It 15 Horror 135 min Andy Muschietti
Dark Water 15 Drama, Horror, Mystery 105 min Walter Salles
A Quiet Place Part II 15 Drama, Horror, Sci-Fi 97 min John Krasinski
A Quiet Place 15 Drama, Horror, Sci-Fi 90 min John Krasinski
The Witches PG Adventure, Comedy, Family 106 min Robert Zemeckis
Resident Evil  Action, Horror, Mystery  Johannes Roberts
Us 15 Horror, Mystery, Thriller 116 min Jordan Peele
Psycho Goreman  Comedy, Horror, Sci-Fi 95 min Steven Kostanski
The Empty Man 18 Crime, Drama, Horror 137 min David Prior
From Dusk Till Dawn 18 Action, Crime, Horror 108 min Robert Rodriguez
The Platform 18 Horror, Sci-Fi, Thriller 94 min Galder Gaztelu-Urrutia
The Conjuring 3  Horror, Mystery, Thriller  Michael Chaves
Underwater 15 Action, Horror, Sci-Fi 95 min William Eubank
My Bloody Valentine 18 Horror, Mystery, Thriller 101 min Patrick Lussier
Sputnik 15 Drama, Horror, Sci-Fi 113 min Egor Abramenko
My Bloody Valentine X Horror, Mystery, Thriller 90 min George Mihalka


HTML就像一个树状结构。您希望找到父节点,然后遍历这些节点以获取其中的内容。这个网站是非常好的实践。Director是唯一一个棘手的部分,因为它位于
标记中,但没有用于区分它的属性。所以你需要做一点逻辑来得到它。(注意,您可以使用regex来查找它,但因为您正在学习,所以希望向您展示一个循环)。我还附加了图像,以便您可以看到我从何处获得这些标记和属性:

import requests
from bs4 import BeautifulSoup


headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'

response = requests.get(my_url, headers=headers)
page_html = response.text
page_soup = BeautifulSoup(page_html,"html.parser")


movies = page_soup.find_all('div',{'class':'lister-item-content'})
for movie in movies:
    title = movie.find('h3').find('a').text
    try:
        rating = movie.find('p').find('span', {'class':'certificate'}).text
    except:
        rating = ''
    genre = movie.find('p').find('span', {'class':'genre'}).text.strip()
    try:
        runtime = movie.find('p').find('span', {'class':'runtime'}).text
    except:
        runtime = ''
    ps = movie.find_all('p')
    for p in ps:
        if 'Director'in p.text:
            director =p.find('a').text
            
    print(title, rating, genre, runtime, director)
输出:

Wrong Turn 18 Horror, Thriller 109 min Mike P. Nelson
Willy's Wonderland 15 Action, Comedy, Horror 88 min Kevin Lewis
Red Dot 15 Drama, Horror, Thriller 86 min Alain Darborg
Saint Maud 15 Drama, Horror, Mystery 84 min Rose Glass
Freaky 15 Comedy, Horror, Thriller 102 min Christopher Landon
Doctor Strange in the Multiverse of Madness  Action, Adventure, Fantasy  Sam Raimi
Midsommar 18 Drama, Horror, Mystery 148 min Ari Aster
Fear of Rain PG-13 Drama, Horror, Thriller 109 min Castille Landon
The Little Stranger 12A Drama, Horror, Mystery 111 min Lenny Abrahamson
Army of the Dead R Action, Crime, Horror  Zack Snyder
Get Out 15 Horror, Mystery, Thriller 104 min Jordan Peele
Synchronic 15 Drama, Horror, Sci-Fi 102 min Justin Benson
The Rental 15 Drama, Horror, Mystery 88 min Dave Franco
Shadow in the Cloud R Action, Horror, War 83 min Roseanne Liang
Don't Worry Darling  Horror, Thriller  Olivia Wilde
Venom: Let There Be Carnage  Action, Horror, Sci-Fi  Andy Serkis
The Shining 15 Drama, Horror 146 min Stanley Kubrick
The Witch 15 Drama, Horror, Mystery 92 min Robert Eggers
Split 15 Horror, Thriller 117 min M. Night Shyamalan
Hereditary 15 Drama, Horror, Mystery 127 min Ari Aster
Wrong Turn 18 Horror, Thriller 84 min Rob Schmidt
Antebellum 15 Drama, Horror, Mystery 105 min Gerard Bush
Possessor 18 Horror, Sci-Fi, Thriller 103 min Brandon Cronenberg
The New Mutants 15 Action, Horror, Sci-Fi 94 min Josh Boone
Doctor Sleep 15 Drama, Fantasy, Horror 152 min Mike Flanagan
The Invisible Man R Drama, Horror, Mystery 124 min Leigh Whannell
The Meg 12A Action, Horror, Sci-Fi 113 min Jon Turteltaub
Alien X Horror, Sci-Fi 117 min Ridley Scott
The Lighthouse 15 Drama, Fantasy, Horror 109 min Robert Eggers
Scream  Horror, Mystery, Thriller  Matt Bettinelli-Olpin
Run PG-13 Horror, Mystery, Thriller 90 min Aneesh Chaganty
Porno 18 Comedy, Horror 98 min Keola Racela
The Hunt 15 Action, Horror, Thriller 90 min Craig Zobel
Becky 18 Action, Crime, Drama 93 min Jonathan Milott
It 15 Horror 135 min Andy Muschietti
Dark Water 15 Drama, Horror, Mystery 105 min Walter Salles
A Quiet Place Part II 15 Drama, Horror, Sci-Fi 97 min John Krasinski
A Quiet Place 15 Drama, Horror, Sci-Fi 90 min John Krasinski
The Witches PG Adventure, Comedy, Family 106 min Robert Zemeckis
Resident Evil  Action, Horror, Mystery  Johannes Roberts
Us 15 Horror, Mystery, Thriller 116 min Jordan Peele
Psycho Goreman  Comedy, Horror, Sci-Fi 95 min Steven Kostanski
The Empty Man 18 Crime, Drama, Horror 137 min David Prior
From Dusk Till Dawn 18 Action, Crime, Horror 108 min Robert Rodriguez
The Platform 18 Horror, Sci-Fi, Thriller 94 min Galder Gaztelu-Urrutia
The Conjuring 3  Horror, Mystery, Thriller  Michael Chaves
Underwater 15 Action, Horror, Sci-Fi 95 min William Eubank
My Bloody Valentine 18 Horror, Mystery, Thriller 101 min Patrick Lussier
Sputnik 15 Drama, Horror, Sci-Fi 113 min Egor Abramenko
My Bloody Valentine X Horror, Mystery, Thriller 90 min George Mihalka


您能告诉我找到正确标签的方法吗?我的代码中哪个标记不正确?我的错误。我以为我发布了我的代码,但显然没有。我稍后会发布。我更新了代码。在这一点上,您只需要在数据循环时构建捕获数据的方法。您能告诉我找到正确标记的方法吗?我的代码中哪个标记不正确?我的错误。我以为我发布了我的代码,但显然没有。我稍后会发布。我更新了代码。在这一点上,您只需要构建在数据循环时捕获数据的方法。感谢您的发布。谢谢你的发帖。有帮助