Python BeautifulSoup-访问更多评论
我正在尝试从IMDB电影链接上截取评论,并提取评论的用户名,我只得到25个用户名,因为这是页面显示的,直到你按下“显示更多”。我需要一种访问所有评论的方法,除了使用Selenium,还有没有其他方法可以做到这一点,因为出于某种原因,我在尝试导入时遇到了SSL证书错误Python BeautifulSoup-访问更多评论,python,selenium,beautifulsoup,request,Python,Selenium,Beautifulsoup,Request,我正在尝试从IMDB电影链接上截取评论,并提取评论的用户名,我只得到25个用户名,因为这是页面显示的,直到你按下“显示更多”。我需要一种访问所有评论的方法,除了使用Selenium,还有没有其他方法可以做到这一点,因为出于某种原因,我在尝试导入时遇到了SSL证书错误 import requests from time import sleep url='https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv' response= requ
import requests
from time import sleep
url='https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv'
response= requests.get(url,verify=False)
response
import bs4
soup=bs4.BeautifulSoup(response.content, 'html5lib')
name=soup.find_all('span', class_='display-name-link')
len(name)
我想不出没有硒元素的点击元素。您可以在浏览器中添加忽略SSL证书错误的选项 Firefox
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')
driver.close()
Chrome
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')
driver.close()
IE
from selenium import webdriver
capabilities = webdriver.DesiredCapabilities().INTERNETEXPLORER
capabilities['acceptSslCerts'] = True
driver = webdriver.Ie(capabilities=capabilities)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')
driver.close()
以下是您要查找的内容:
import requests as r
from time import sleep
from bs4 import BeautifulSoup
# We'll use this link that gets only the reviews
url_reviews = 'https://www.imdb.com/title/tt0068646/reviews/_ajax'
# We need to get a key everytime to scrape the next page
url_reviews_next = 'https://www.imdb.com/title/tt0068646/reviews/_ajax?paginationKey='
response= r.get(url_reviews)
soup = BeautifulSoup(response.text, 'html.parser')
name = soup.find_all('span', class_='display-name-link')
# Get your data here
paginationKey = ''
# If there's only one page and more, there's won't be the class load_more_data
try:
paginationKey = soup.find_all('div', class_='load-more-data')[0]['data-key']
except:
paginationKey = ''
print(paginationKey)
while paginationKey != '':
response= r.get(url_reviews_next+paginationKey)
soup = BeautifulSoup(response.text, 'html.parser')
# Get your data here
try:
paginationKey = soup.find_all('div', class_='load-more-data')[0]['data-key']
except:
paginationKey = ''
print(paginationKey)
# If the page has no more pagination, it will catch the exception
每次我们需要提取分页键来刮取下一页时,无需selenium
来刮取所有用户名(总共4041个),发送GET
请求来模拟单击按钮:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv"
ajax_url = "https://www.imdb.com/title/tt0068646/reviews/_ajax?ref_=undefined&paginationKey={}"
soup = BeautifulSoup(requests.get(main_url).content, "html5lib")
while True:
for tag in soup.select(".display-name-link"):
print(tag.text)
print("-" * 30)
button = soup.select_one(".load-more-data")
if not button:
break
key = button["data-key"]
soup = BeautifulSoup(requests.get(ajax_url.format(key)).content, "html5lib")
输出:
CalRhys
gogoschka-1
SJ_1
andrewburgereviews
alexkolokotronis
MR_Heraclius
b-a-h TNT-6
danielfeerst
mattrochman
Godz365
winnantonio
Trevizolga
DaveDiggler
ks4
...
... All the way until
Steven Bray
Castor-5
BLDJ
pinky67
dean keaton
rejoefrankel
Timothy