Python 3.x 我如何从网站上刮取评论;flipkart.com“;

Python 3.x 我如何从网站上刮取评论;flipkart.com“;,python-3.x,web-scraping,Python 3.x,Web Scraping,我不知道该选择哪个类,我尝试了不同的选择器类,但返回的是空列表 我尝试了以下代码 import requests as req from bs4 import BeautifulSoup as bs url = 'https://www.flipkart.com/nokia-6-1-plus-black-64-gb/product-reviews/itmf8r36g9gfpafg?pid=MOBF8FCFB9KWUTVQ' page = req.get(url) rev = soup.f

我不知道该选择哪个类,我尝试了不同的选择器类,但返回的是空列表

我尝试了以下代码

import requests as req

from bs4 import BeautifulSoup as bs

url = 'https://www.flipkart.com/nokia-6-1-plus-black-64-gb/product-reviews/itmf8r36g9gfpafg?pid=MOBF8FCFB9KWUTVQ'

page = req.get(url)

rev = soup.find_all(class_ = "_2xg6Ul")

我想把评论导出并存储在一个文本文件中,以便以后使用

你从来不会在你说你尝试过的代码中定义
soup

但是您不需要使用Selenium,因为评论在
标记中。唯一需要注意的是,您必须遍历每个页面才能获得所有评论,但如果您要使用Selenium,您无论如何都需要这样做(因为它每页只有10页……在本例中,有1976页)。但这会让你得到评论:

注意:我只写了5页。如果您想完成所有1900+,则需要注释掉我硬编码的行。

import requests as req
from bs4 import BeautifulSoup as bs
import json
import math

# Get Total Pages
url = 'https://www.flipkart.com/nokia-6-1-plus-black-64-gb/product-reviews/itmf8r36g9gfpafg?pid=MOBF8FCFB9KWUTVQ'

page = req.get(url)
soup = bs(page.text, 'html.parser')


scripts = soup.find_all('script')
for script in scripts:
    if 'window.__INITIAL_STATE__ = ' in script.text:
        script_str = script.text
        jsonStr = script_str.split('window.__INITIAL_STATE__ = ')[1]
        jsonStr = jsonStr.rsplit(';',1)[0]

        jsonObj = json.loads(jsonStr)
        total_pages = math.ceil(jsonObj['ratingsAndReviews']['reviewsData']['totalCount'] / 10)





total_pages=5  # <------ remove this to get all pages, or set you page limit

for page in range(1,total_pages+1):
    page_url = url + '&page=%s' %page

    print ('Page %s' %page)
    page = req.get(page_url)
    soup = bs(page.text, 'html.parser')


    scripts = soup.find_all('script')
    for script in scripts:
        if 'window.__INITIAL_STATE__ = ' in script.text:
            script_str = script.text
            jsonStr = script_str.split('window.__INITIAL_STATE__ = ')[1]
            jsonStr = jsonStr.rsplit(';',1)[0]

            jsonObj = json.loads(jsonStr)


    for each in jsonObj['ratingsAndReviews']['reviewsData']['reviewsData']['nonAspectReview']:
        print (each['value']['text'],'\n')
按请求导入请求
从bs4导入BeautifulSoup作为bs
导入json
输入数学
#获取总页数
url='1〕https://www.flipkart.com/nokia-6-1-plus-black-64-gb/product-reviews/itmf8r36g9gfpafg?pid=MOBF8FCFB9KWUTVQ'
page=req.get(url)
soup=bs(page.text,'html.parser')
scripts=soup.find_all('script')
对于脚本中的脚本:
如果script.text中的“window.\uuu INITIAL\u STATE\uuuu=”:
script\u str=script.text
jsonStr=script\u str.split('window.\u INITIAL\u STATE\u=')[1]
jsonStr=jsonStr.rsplit(“;”,1)[0]
jsonObj=json.loads(jsonStr)
总页数=math.ceil(jsonObj['ratingsAndReviews']['ReviewData']['totalCount']/10)

total_pages=5#此页面使用JavaScript加载数据,但BS不运行JavaScript。您可以使用Python+Selenium来控制web浏览器,它将加载页面并运行JavaScript。很抱歉,我在原始代码中定义了soup,但忘了在此处定义它。我有点想,但只是想提及它,因为它不在帖子中。这个解决方案对你有效吗?