Python抓取文本数据_Python_Beautifulsoup

Python抓取文本数据

python

Python抓取文本数据,python,beautifulsoup,Python,Beautifulsoup,我是python的新手，我正在尝试使用python中的beautifulSoup从网站抓取一些文本评论。html结构的一部分如下所示： <div style="1st level"> <div style="2nd level">Here is text 1</div> <div style="2nd level">Here is text 2</div> <div style="2nd level">

我是python的新手，我正在尝试使用python中的beautifulSoup从网站抓取一些文本评论。html结构的一部分如下所示：

<div style="1st level">
    <div style="2nd level">Here is text 1</div>
    <div style="2nd level">Here is text 2</div>
    <div style="2nd level">Here is text 3</div>
    <div style="2nd level">Here is text 4</div>
    Here is text 5 and this is the part I want to get.
<div>

但后来我从课文1到课文5都得到了。有没有一种简单的方法可以让我定位到第1级，只获取文本5？

不确定这些方法是否最好，但请尝试一下：

from bs4 import BeautifulSoup as soup
from collections import deque


input = """<div style="1st level">
    <div style="2nd level">Here is text 1</div>
    <div style="2nd level">Here is text 2</div>
    <div style="2nd level">Here is text 3</div>
    <div style="2nd level">Here is text 4</div>
    Here is text 5 and this is the part I want to get.
<div>"""

web_soup = soup(input)
reviews = web_soup.find('div', style="1st level")

print reviews.contents[-2]
print deque(reviews.strings, maxlen=1).pop()

仅供参考，我已使用

deque

从

strings

生成器中获取最后一个元素

而且，仅供参考，通过使用

text（）

，lxml+xpath可以更轻松地完成这项工作

希望这能有所帮助。

我们几乎肯定需要相关网站的html源代码来帮助您解决这一问题。我去修复了您的问题格式，发现其中有一些html源代码（尽管还不够）。请编辑您的帖子，插入实际的html源代码，并确保格式正确。您确定那些

style=

不应该是

class=

？

from bs4 import BeautifulSoup as soup
from collections import deque


input = """<div style="1st level">
    <div style="2nd level">Here is text 1</div>
    <div style="2nd level">Here is text 2</div>
    <div style="2nd level">Here is text 3</div>
    <div style="2nd level">Here is text 4</div>
    Here is text 5 and this is the part I want to get.
<div>"""

web_soup = soup(input)
reviews = web_soup.find('div', style="1st level")

print reviews.contents[-2]
print deque(reviews.strings, maxlen=1).pop()

Here is text 5 and this is the part I want to get.