Python抓取文本数据
我是python的新手,我正在尝试使用python中的beautifulSoup从网站抓取一些文本评论。html结构的一部分如下所示:Python抓取文本数据,python,beautifulsoup,Python,Beautifulsoup,我是python的新手,我正在尝试使用python中的beautifulSoup从网站抓取一些文本评论。html结构的一部分如下所示: <div style="1st level"> <div style="2nd level">Here is text 1</div> <div style="2nd level">Here is text 2</div> <div style="2nd level">
<div style="1st level">
<div style="2nd level">Here is text 1</div>
<div style="2nd level">Here is text 2</div>
<div style="2nd level">Here is text 3</div>
<div style="2nd level">Here is text 4</div>
Here is text 5 and this is the part I want to get.
<div>
但后来我从课文1到课文5都得到了。有没有一种简单的方法可以让我定位到第1级,只获取文本5?不确定这些方法是否最好,但请尝试一下:
from bs4 import BeautifulSoup as soup
from collections import deque
input = """<div style="1st level">
<div style="2nd level">Here is text 1</div>
<div style="2nd level">Here is text 2</div>
<div style="2nd level">Here is text 3</div>
<div style="2nd level">Here is text 4</div>
Here is text 5 and this is the part I want to get.
<div>"""
web_soup = soup(input)
reviews = web_soup.find('div', style="1st level")
print reviews.contents[-2]
print deque(reviews.strings, maxlen=1).pop()
仅供参考,我已使用deque
从strings
生成器中获取最后一个元素
而且,仅供参考,通过使用text()
,lxml+xpath可以更轻松地完成这项工作
希望这能有所帮助。我们几乎肯定需要相关网站的html源代码来帮助您解决这一问题。我去修复了您的问题格式,发现其中有一些html源代码(尽管还不够)。请编辑您的帖子,插入实际的html源代码,并确保格式正确。您确定那些
style=
不应该是class=
?
from bs4 import BeautifulSoup as soup
from collections import deque
input = """<div style="1st level">
<div style="2nd level">Here is text 1</div>
<div style="2nd level">Here is text 2</div>
<div style="2nd level">Here is text 3</div>
<div style="2nd level">Here is text 4</div>
Here is text 5 and this is the part I want to get.
<div>"""
web_soup = soup(input)
reviews = web_soup.find('div', style="1st level")
print reviews.contents[-2]
print deque(reviews.strings, maxlen=1).pop()
Here is text 5 and this is the part I want to get.