Python beautifulsoup.get_text（）对于我的HTML解析不够具体_Python_Html_Regex_Beautifulsoup

Python beautifulsoup.get_text（）对于我的HTML解析不够具体

python html regex

Python beautifulsoup.get_text（）对于我的HTML解析不够具体,python,html,regex,beautifulsoup,Python,Html,Regex,Beautifulsoup,给定下面的HTML代码，我只希望输出h1的文本，而不是“Details about”，这是跨度的文本（由h1封装）我的当前输出给出： Details about New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black 我想： New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black 这是我正在使用的HTML

给定下面的HTML代码，我只希望输出h1的文本，而不是“Details about”，这是跨度的文本（由h1封装）

我的当前输出给出：

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

我想：

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

这是我正在使用的HTML

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

注意：我不想仅仅截断字符串，因为我希望这段代码具有一些可重用性。

最好的方法是使用一些代码来裁剪出由跨度限定的任何文本。

一种解决方案是检查字符串是否包含

html

：

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

您可以使用删除所有

span

标记：

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black