Python 3.x 当所有需要的数据都不是文本格式时,如何刮取评论?

Python 3.x 当所有需要的数据都不是文本格式时,如何刮取评论?,python-3.x,web-scraping,beautifulsoup,python-requests,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,我在努力为大学研究搜集评论。我的代码打印出了我需要的大部分信息,但我还需要找到评级和用户ID 这是我的一些代码 import requests from bs4 import BeautifulSoup s = requests.Session() headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103

我在努力为大学研究搜集评论。我的代码打印出了我需要的大部分信息,但我还需要找到评级和用户ID

这是我的一些代码

import requests
from bs4 import BeautifulSoup


s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
           'Referer': "http://www.imdb.com/"}


url = 'http://www.imdb.com/title/tt0082158/reviews?ref_=tt_urv'
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

s.post(url, headers=headers)

for i in soup('style'):
    i.decompose()
for s in soup('script'):
    s.decompose()
for t in soup('table'):
    t.decompose()
for ip in soup('input'):
    ip.decompose()

important = soup.find("div", id='tn15content')

print(important.text)
这会在这样的打印输出中返回我需要的大部分信息

输出(仅显示这一条评论,在页面上打印所有评论)

但是,我还需要为每部电影提供用户ID和评级

userID包含在每个a href元素中,如下所示

<a href="/user/ur0511587/">

评级包含在每个img元素中,如下所示,其中评级等于alt属性中的“10/10”

<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif">

除了打印“important.text”而不只是打印“important”就可以轻松刮取的输出之外,还有什么关于如何刮取这两个项目的提示吗?我很犹豫是否只打印“重要”字样,因为它会把所有标签和其他不必要的东西弄得乱七八糟。感谢您的输入。

您可以使用css选择器,
a[href^=/user/ur]
将找到所有具有a href以
/user/ur
开头的锚点,
img[alt*=/10]
将找到所有具有alt属性且值为
“some_number/10”
的img标记:

现在的问题是,并不是每个评论都有一个评级,只要找到每个a[href^=/user/ur]就会给我们带来超出我们想要的东西,因此为了解决这个问题,我们可以找到包含锚和评论的特定div(如果存在),方法是找到包含文本review的小标记,然后调用.parent来选择div

import re
important = soup.find("div", id='tn15content')

for small in important.find_all("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")
现在我们得到:

('0511587', '10/10')
('0209436', '9/10')
('1318093', 'N/A')
('0556711', '10/10')
('0075285', '9/10')
('0059151', '10/10')
('4445210', '9/10')
('0813687', 'N/A')
('0033913', '10/10')
('0819028', 'N/A')
您还需要做更多的工作来获取源代码,只需一个get请求,所需的完整代码如下:

import requests
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
url = 'http://www.imdb.com/title/tt0082158/reviews?ref_=tt_urv'

soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml")


important = soup.find("div", id='tn15content')

for small in important.find_all("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")
要获取审阅文本,只需找到div后面的下一个p:

for small in important.find_all("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")
    print(div.find_next("p").text.strip())
这将为您提供如下输出:

('0511587', '10/10')
I happened to be flipping channels today and saw this was on.  Since it
had
been several years since I last saw it I clicked it on, but didn't mean to
stay.  As it happened, I found this film to be just as gripping now as it
was before.  My own kids started watching it, too, and enjoyed it - which
was even more satisfying for me considering the kind of current junk
they're
used to.  No, this is not an action-packed thriller, nor are there juicy
love scenes between Abrahams and his actress girlfriend.  There is no
"colorful" language to speak of; no politically correct agenda underlying
its tale of a Cambridge Jew and Scottish Christian.This is a story about what drives people internally - what pushes them to
excel or at least to make the attempt to do so.  It is a story about
personal and societal values, loyalty, faith, desire to be accepted in
society and healthy competition without the utter selfishness that
characterizes so much of the athletic endeavors of our day.  Certainly the
characters are not alike in their motivation, but the end result is the
same
as far as their accomplishments.My early adolescent son (whose favorite movies are all of the Star Wars
movies and The Matrix) couldn't stop asking questions throughout the movie
he was so hooked.  It was a great educational opportunity as well as
entertainment.  If you've never seen this film or it's been a long time, I
recommend it unabashedly, regardless of the labels many have tried to give
it for being slow-paced or causing boredom.  In addition to the great
story
- based on real people and events - the photography and the music are
fabulous and moving.  It's no mistake that this movie has been spoofed and
otherwise stolen from in the last twenty years - it's an unforgettable
movie
and in my opinion its bashers are those who hate Oscar winners on
principle
or who don't like the philosophies espoused by its protagonists.

帕德雷克,太好了,谢谢。这很有帮助。只是想知道未来,是否有可能这样做,以便我能够在其附带的评论和我之前打印的信息旁边打印评级和userId?@user6326823,无需担心,请参阅与相关评论文本相关的编辑无需担心,通常我们会使用id、类名等。。但是,在这个特定的网站上,并没有任何有用或可靠的工具来满足我们的需求
for small in important.find_all("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")
    print(div.find_next("p").text.strip())
('0511587', '10/10')
I happened to be flipping channels today and saw this was on.  Since it
had
been several years since I last saw it I clicked it on, but didn't mean to
stay.  As it happened, I found this film to be just as gripping now as it
was before.  My own kids started watching it, too, and enjoyed it - which
was even more satisfying for me considering the kind of current junk
they're
used to.  No, this is not an action-packed thriller, nor are there juicy
love scenes between Abrahams and his actress girlfriend.  There is no
"colorful" language to speak of; no politically correct agenda underlying
its tale of a Cambridge Jew and Scottish Christian.This is a story about what drives people internally - what pushes them to
excel or at least to make the attempt to do so.  It is a story about
personal and societal values, loyalty, faith, desire to be accepted in
society and healthy competition without the utter selfishness that
characterizes so much of the athletic endeavors of our day.  Certainly the
characters are not alike in their motivation, but the end result is the
same
as far as their accomplishments.My early adolescent son (whose favorite movies are all of the Star Wars
movies and The Matrix) couldn't stop asking questions throughout the movie
he was so hooked.  It was a great educational opportunity as well as
entertainment.  If you've never seen this film or it's been a long time, I
recommend it unabashedly, regardless of the labels many have tried to give
it for being slow-paced or causing boredom.  In addition to the great
story
- based on real people and events - the photography and the music are
fabulous and moving.  It's no mistake that this movie has been spoofed and
otherwise stolen from in the last twenty years - it's an unforgettable
movie
and in my opinion its bashers are those who hate Oscar winners on
principle
or who don't like the philosophies espoused by its protagonists.