Python beautifulsoup.get_text()对于我的HTML解析不够具体
给定下面的HTML代码,我只希望输出h1的文本,而不是“Details about”,这是跨度的文本(由h1封装) 我的当前输出给出:Python beautifulsoup.get_text()对于我的HTML解析不够具体,python,html,regex,beautifulsoup,Python,Html,Regex,Beautifulsoup,给定下面的HTML代码,我只希望输出h1的文本,而不是“Details about”,这是跨度的文本(由h1封装) 我的当前输出给出: Details about New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black 我想: New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black 这是我正在使用的HTML
Details about New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
我想:
New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
这是我正在使用的HTML
<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>
注意:我不想仅仅截断字符串,因为我希望这段代码具有一些可重用性。
最好的方法是使用一些代码来裁剪出由跨度限定的任何文本。一种解决方案是检查字符串是否包含
html
:
from bs4 import BeautifulSoup
html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')
for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
for content in line.contents:
if bool(BeautifulSoup(str(content), "html.parser").find()):
continue
print content
您可以使用删除所有span
标记:
for line in soup.find_all('h1',attrs={'itemprop':'name'}):
[s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black
import bs4
html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')
for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
for content in line.contents:
if isinstance(content, bs4.element.Tag):
continue
print content
for line in soup.find_all('h1',attrs={'itemprop':'name'}):
[s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black