Python 如何通过触发'；阅读更多'；按钮_Python_Web Scraping_Beautifulsoup

Python 如何通过触发'；阅读更多'；按钮

python web-scraping

Python 如何通过触发'；阅读更多'；按钮,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从python中使用BeautifulSoup中获取评论实际上，评论内容有一个“阅读更多…”按钮。如何触发该按钮以获取全部内容我发现当我单击按钮时，会触发一个XHR请求。我如何使用python模拟它另外，在检查了“阅读更多…”按钮后，我得到了以下信息： <a style="cursor:pointer" onclick="bindreviewcontent('2836986',this,false,'I found this review of ICICI Lombard A

我正试图从python中使用BeautifulSoup中获取评论

实际上，评论内容有一个“阅读更多…”按钮。如何触发该按钮以获取全部内容

我发现当我单击按钮时，会触发一个XHR请求。我如何使用python模拟它

另外，在检查了“阅读更多…”按钮后，我得到了以下信息：

<a style="cursor:pointer" onclick="bindreviewcontent('2836986',this,false,'I found this review of ICICI Lombard Auto Insurance pretty useful',925641018,'.jpg','I found this review of ICICI Lombard Auto Insurance pretty useful %23WriteShareWin','https://www.mouthshut.com/review/ICICI-Lombard-Auto-Insurance-review-rmlrrturotn','ICICI Lombard Auto Insurance',' 1/5','rmlrrturotn');">Read More</a>

阅读更多

如何使用python触发onclick事件？

有两种方法。一种方法是使用。它允许您以编程方式控制浏览器（支持最常见的浏览器，如Firefox和Chrome）。我对它不熟悉，在很多情况下可能会有些过火（我想浏览器会产生一些开销），但知道这一点很好

另一种方法是做更多的检查，看看当你点击“阅读更多”按钮时发生了什么。开发者工具中的“网络”选项卡（我使用的是Chrome，但我认为Firefox也有同样的功能）可以帮助您显示浏览器发送的所有HTTP请求

我发现当你点击“阅读更多”按钮时，一个

POST

请求被发送到

https://www.mouthshut.com/review/CorporateResponse.ashx

包含以下数据：

type: review
reviewid: 2836986
corp: false
isvideo: false
fbmessage: I found this review of ICICI Lombard Auto Insurance pretty useful
catid: 925641018
prodimg: .jpg
twittermsg: I found this review of ICICI Lombard Auto Insurance pretty useful %23WriteShareWin
twitterlnk: https://www.mouthshut.com/review/ICICI-Lombard-Auto-Insurance-review-rmlrrturotn
catname: ICICI Lombard Auto Insurance
rating_str:  1/5
usession: 0

然而，当我刚刚发送了一个带有这些数据的

POST

请求时，它没有起作用。这通常意味着HTTP头中有一些重要的内容。通常是饼干；我已证实情况确实如此。使用

requests

包（您无论如何都应该完全使用它），解决方案很简单：使用

requests.Session

以下是概念证明：

import requests
with requests.Session() as s:
    s.get('https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018')
    print(s.post('https://www.mouthshut.com/review/CorporateResponse.ashx',
                 data = {'type': 'review', 'reviewid': '2836986', 'catid': '925641018', 'corp': 'false', 'catname': ''}
                ).text)

结果是一些html包含您正在寻找的内容。享受喝汤吧

提取所有带有评级和链接的评论

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd


def add_reviews(s, soup, results):
    for review in soup.select('.review-article'):
        info = review.select_one('a')
        identifier = review.select_one('[reviewid]')['reviewid']
        data['reviewid'] = identifier
        title = info.text
        link = info['href']
        rating = len(review.select('.rated-star'))
        r = s.post('https://www.mouthshut.com/review/CorporateResponse.ashx', data)
        soup2 = bs(r.content, 'lxml')
        review = ' '.join([i.text for i in soup2.select('p')])
        row = [title, link, rating, review]
        results.append(row)

url = 'https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018-page-{}'
data = {'type': 'review', 'reviewid': '', 'catid': '925641018', 'corp': 'false', 'catname': ''}
results = []

with requests.Session() as s:
    r = s.get('https://www.mouthshut.com/product-reviews/ICICI-Lombard-Auto-Insurance-reviews-925641018')
    soup = bs(r.content, 'lxml')
    pages = int(soup.select('#spnPaging .btn-link')[-1].text)
    add_reviews(s, soup, results)
    if pages > 1:
        for page in range(2, pages + 1):
            r = s.get(url.format(page))
            soup = bs(r.content, 'lxml')
            add_reviews(s, soup, results)

df = pd.DataFrame(results, columns = ['Title', 'Link', 'Rating', 'Review'])
print(df)

Flipkart等网站需要Selenium等工具以编程方式单击“阅读更多”链接。这是一个用于此类实现的示例。

您能展示您迄今为止所做的尝试吗？我尝试使用相同的代码从flipkart站点提取数据-您能通过单击“阅读更多”帮助我提取数据吗site@user3415910“阅读更多”在哪里？我的意思是在评论中“阅读更多”下面的答案是什么？