Python 3.x 我可以使用alt=[Next]从每个页面收集评论吗?

Python 3.x 我可以使用alt=[Next]从每个页面收集评论吗?,python-3.x,web-scraping,beautifulsoup,python-requests,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,再一次,我试图为一点大学研究寻求帮助。我正试图找到一种方法,不必手动编写每个url并在一组中对其进行迭代,就可以获取每部电影的所有评论 所以,我试图找到“下一步”按钮,并用它来指导收集多少页评论。理论上,我希望它停在评论的最后一页,因为最后一页上没有“下一步”按钮。因此,如果有三页的评论,它将停止获得第三页的评论 为了简单起见,这只是我现在拥有的一些代码,但它只得到了第一页的评论 import requests from bs4 import BeautifulSoup s = request

再一次,我试图为一点大学研究寻求帮助。我正试图找到一种方法,不必手动编写每个url并在一组中对其进行迭代,就可以获取每部电影的所有评论

所以,我试图找到“下一步”按钮,并用它来指导收集多少页评论。理论上,我希望它停在评论的最后一页,因为最后一页上没有“下一步”按钮。因此,如果有三页的评论,它将停止获得第三页的评论

为了简单起见,这只是我现在拥有的一些代码,但它只得到了第一页的评论

import requests
from bs4 import BeautifulSoup

s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
           'Headers': "http://www.imdb.com/"}

count = 0
url = 'http://www.imdb.com/title/tt0182408/reviews?start=' + str(count)
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

nv = soup.find("input", value="nv_sr_fn")["value"]

hidden_data = dict(ref_=nv)

s.post(url, data=hidden_data, headers=headers)

important = soup.find("div", id='tn15content')


for div in important.findAll("div"):
    for p in div.findAll("p"):
        p.decompose()

for small in important.findAll("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("r/")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")
    print(div.findAll("small"))
    print(div.find_next("h2").text.strip())
    print(div.find_next("a").text.strip())
    print(div.find_next("p").text.strip())

for td in important.findAll('td'):
    for a in td.findAll('a'):
        for img in a.findAll('img', alt=True):
            if img['alt'] == "[Next]":
                count = +10

            else:
                break
这是我得到的最后一篇评论,在第一页

ur0186755 1/10
[<small>11 out of 20 people found the following review useful:</small>, <small>from South Texas</small>, <small>27 March 1999</small>]
One of the stupidest films ever made...

Before I start to tear apart this movie, mark you--I LOVE THE SCARLET
PIMPERNEL. That story is one of the best romantic adventures ever written.
The movie staring Jane Grey is very good and the musical on Broadway is
the
hottest thing there. So, I thought when I heard that this film was coming
out that it would be great since it was a BBC film.To my surprise, it was a weak, totally stupid story that UTTERLY failed in
capturing the gorgeous tale.There were no exciting escapes with daring disguises. There was no deep
love
that made your heart flutter as Percy left the room and Marguerite sighed
as
her husband was leaving her again.All it had was a confusing plot and a lot of out-of-the-blue sex and
violence.Sink me! What a horrible movie!

除了手动将URL放在一个集合中并对其进行迭代之外,还有关于如何从每个页面收集评论的任何提示。还是我必须这么做?非常感谢。

首先,确保您没有违反任何法律规定,并遵守法律。您最好使用IMDB API,而不是使用web抓取

为了回答您的问题,我将根据下一个链接的存在,使用中断条件进行无休止的循环:


打印30行评论标题,每页10行。

首先,确保您没有违反任何规则,并遵守法律。您最好使用IMDB API,而不是使用web抓取

为了回答您的问题,我将根据下一个链接的存在,使用中断条件进行无休止的循环:


打印30行评论标题,每页10行。

您可以继续,直到带有alt的img不在页面上,您可以通过调用img标签上的.parent来获取下一页href:


您可能还想考虑在每个请求之间添加一个休眠,或者更好地使用

您可以继续下去,直到带有ALT的IMG不在页面上,您可以通过调用获得下一页HREF。IMG标签上的父:


您可能还想考虑在每个请求之间添加一个休眠,或者更好地从机器人和屏幕抓取中使用

:您可以不使用数据挖掘、机器人、屏幕擦除或类似的数据收集和提取工具在这个站点上,除非获得以下明确书面同意。来自:机器人和屏幕抓取:除非获得以下明确书面同意,否则您不得在本网站上使用数据挖掘、机器人、屏幕抓取或类似的数据收集和提取工具。我是否可以对多部电影这样做?这对于一部电影来说非常有效,但是当我尝试将多部电影放在一组中时,它就不起作用了。有没有什么方法可以让我对多部电影这样做?这对于一部电影来说非常有效,但一旦我尝试将多部电影放在一组中,它就不起作用了。
import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    session.headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
        'Headers': "http://www.imdb.com/"
    }

    page = 0
    while True:
        url = 'http://www.imdb.com/title/tt0182408/reviews?start=' + str(page)
        response = session.get(url)
        soup = BeautifulSoup(response.content, "lxml")

        important = soup.find("div", id='tn15content')
        for title in important.find_all("h2"):
            print(title.get_text())

        # break if no Next button present
        if not soup.find("img", alt="[Next]"):
            break

        page += 10
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

def parse(soup):
    important = soup.find("div", id='tn15content')
    for small in important.find_all("small", text=re.compile("review useful:")):
        div = small.parent
        user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
        rating = div.select_one("img[alt*=/10]")
        yield user_id, rating["alt"] if rating else "N/A"


def get_all_pages(start):
    base = "http://www.imdb.com/title/tt0082158/"
    soup = BeautifulSoup(requests.get(start).content)
    for tup in parse(soup):
        yield tup

    for nxt in iter(lambda: soup.find("img", alt="[Next]"), None):
        soup = BeautifulSoup(requests.get(urljoin(base, nxt.parent["href"])).content)
        for tup in parse(soup):
            yield tup


for uid, rat in get_all_pages(start):
    print(uid, rat)