Python 美化组抓取文本_Python_Beautifulsoup

Python 美化组抓取文本

python

Python 美化组抓取文本,python,beautifulsoup,Python,Beautifulsoup,对于如何使用BeautifulSoup获取内容，我有点困惑，我试图获取的html如下所示： <div class="txt-block"> <h4 class="inline">Gross:</h4> $408,992,272 </div> Data是我的beautifulsoup对象，并且还有多个h4标记的实例，class\uz=inline 我可以抓取所有的h4标记，只要我也可以在其中获取数字，然后我就可以对其进行正则化。

对于如何使用BeautifulSoup获取内容，我有点困惑，我试图获取的html如下所示：

<div class="txt-block"> 
    <h4 class="inline">Gross:</h4> 
    $408,992,272
</div>

Data是我的beautifulsoup对象，并且还有多个

h4

标记的实例，

class\uz=inline

我可以抓取所有的

h4

标记，只要我也可以在其中获取数字，然后我就可以对其进行正则化。

如果您只需要美元金额，请从txt block div设置recursive=False中查找所有文本，这样您就不会从其子项中获取任何文本并去掉任何空白：

In [27]:h = """<div class="txt-block">
                   <h4 class="inline">Gross:</h4>
                    $408,992,272
               </div>"""

In [28]: soup = BeautifulSoup(h,"lxml")

In [29]: div = soup.find("div",class_="txt-block")

In [30]: "".join(div.find_all(text=True, recursive=False)).strip()
Out[30]: '$408,992,272'

你能确保html是完整的和正确的吗？看起来美元数字超出了h4标签。您可能需要首先获取h4标签和美元编号的父项，然后从那里开始。美元数的父元素的结构是什么？啊，明白了，父元素是

Gross:$408992272

，我可以从这里得到它。谢谢大家！

In [27]:h = """<div class="txt-block">
                   <h4 class="inline">Gross:</h4>
                    $408,992,272
               </div>"""

In [28]: soup = BeautifulSoup(h,"lxml")

In [29]: div = soup.find("div",class_="txt-block")

In [30]: "".join(div.find_all(text=True, recursive=False)).strip()
Out[30]: '$408,992,272'

In [40]: div.contents[-1].strip()
Out[40]: '$408,992,272'