Python BeautifulSoup-访问更多评论_Python_Selenium_Beautifulsoup_Request

Python BeautifulSoup-访问更多评论

python selenium

Python BeautifulSoup-访问更多评论,python,selenium,beautifulsoup,request,Python,Selenium,Beautifulsoup,Request,我正在尝试从IMDB电影链接上截取评论，并提取评论的用户名，我只得到25个用户名，因为这是页面显示的，直到你按下“显示更多”。我需要一种访问所有评论的方法，除了使用Selenium，还有没有其他方法可以做到这一点，因为出于某种原因，我在尝试导入时遇到了SSL证书错误 import requests from time import sleep url='https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv' response= requ

我正在尝试从IMDB电影链接上截取评论，并提取评论的用户名，我只得到25个用户名，因为这是页面显示的，直到你按下“显示更多”。我需要一种访问所有评论的方法，除了使用Selenium，还有没有其他方法可以做到这一点，因为出于某种原因，我在尝试导入时遇到了SSL证书错误

import requests
from time import sleep
url='https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv'
response= requests.get(url,verify=False)
response
import bs4
soup=bs4.BeautifulSoup(response.content, 'html5lib')
name=soup.find_all('span', class_='display-name-link')
len(name)

我想不出没有硒元素的点击元素。您可以在浏览器中添加忽略SSL证书错误的选项

Firefox

from selenium import webdriver

profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True

driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')

driver.close()

Chrome

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('ignore-certificate-errors')

driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')

driver.close()

from selenium import webdriver

capabilities = webdriver.DesiredCapabilities().INTERNETEXPLORER
capabilities['acceptSslCerts'] = True

driver = webdriver.Ie(capabilities=capabilities)
driver.get('https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv')

driver.close()

以下是您要查找的内容：

import requests as r
from time import sleep
from bs4 import BeautifulSoup
# We'll use this link that gets only the reviews
url_reviews = 'https://www.imdb.com/title/tt0068646/reviews/_ajax'
# We need to get a key everytime to scrape the next page
url_reviews_next = 'https://www.imdb.com/title/tt0068646/reviews/_ajax?paginationKey='
response= r.get(url_reviews)
soup = BeautifulSoup(response.text, 'html.parser')
name = soup.find_all('span', class_='display-name-link')
# Get your data here
paginationKey = ''
# If there's only one page and more, there's won't be the class load_more_data
try:
    paginationKey = soup.find_all('div', class_='load-more-data')[0]['data-key']
except:
    paginationKey = ''
print(paginationKey)
while paginationKey != '':
    response= r.get(url_reviews_next+paginationKey)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Get your data here
    try:
        paginationKey = soup.find_all('div', class_='load-more-data')[0]['data-key']
    except:
        paginationKey = ''
    print(paginationKey)
    # If the page has no more pagination, it will catch the exception

每次我们需要提取

分页键来刮取下一页时，无需selenium
来刮取所有用户名（总共4041个），发送GET
请求来模拟单击按钮：
import requests
from bs4 import BeautifulSoup

main_url = "https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv"
ajax_url = "https://www.imdb.com/title/tt0068646/reviews/_ajax?ref_=undefined&paginationKey={}"
soup = BeautifulSoup(requests.get(main_url).content, "html5lib")

while True:
    for tag in soup.select(".display-name-link"):
        print(tag.text)
    print("-" * 30)

    button = soup.select_one(".load-more-data")
    if not button:
        break

    key = button["data-key"]
    soup = BeautifulSoup(requests.get(ajax_url.format(key)).content, "html5lib")

输出：
CalRhys
gogoschka-1
SJ_1
andrewburgereviews
alexkolokotronis
MR_Heraclius
b-a-h TNT-6
danielfeerst
mattrochman
Godz365
winnantonio
Trevizolga
DaveDiggler
ks4
...
... All the way until
Steven Bray
Castor-5
BLDJ
pinky67
dean keaton
rejoefrankel
Timothy