Python 3.x 在HTML文档中保留一个嵌套div,并清除所有其他div
我正试图从一个嘈杂的、嵌套很深的HTML文档中去除这些污点。我希望保持页面的结构,只需清除周围div的内容 结构是这样的:Python 3.x 在HTML文档中保留一个嵌套div,并清除所有其他div,python-3.x,beautifulsoup,lxml,Python 3.x,Beautifulsoup,Lxml,我正试图从一个嘈杂的、嵌套很深的HTML文档中去除这些污点。我希望保持页面的结构,只需清除周围div的内容 结构是这样的: <div class="a"> ...stuff... <div> ...stuff.... <div class="my_class_of_interest"> ....several levels deeper... </div>
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>
但这抹去了我的兴趣,我怀疑,因为我正在清算它的母公司,而清算会一直进行下去。有没有一种方法可以在不删除嵌套div的情况下清除div的文本?或者有更好的方法吗?我希望我能很好地理解你的问题。此脚本将删除感兴趣标记周围的所有字符串:
from bs4 import BeautifulSoup, Tag
txt = '''
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
# print soup before clearing
print(soup)
def clear(tag):
for c in tag.contents:
if isinstance(c, Tag) and c.name == 'div' and 'my_class_of_interest' in c.get('class', []):
continue
elif isinstance(c, Tag):
clear(c)
else:
c.replace_with('')
clear(soup.select_one('div.a'))
print('-' * 80)
# print soup after clearing:
print(soup.prettify())
印刷品:
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>
--------------------------------------------------------------------------------
<div class="a">
<div>
<div class="my_class_of_interest">
....several levels deeper...
</div>
</div>
</div>
我希望我能很好地理解你的问题。此脚本将删除感兴趣标记周围的所有字符串:
from bs4 import BeautifulSoup, Tag
txt = '''
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
# print soup before clearing
print(soup)
def clear(tag):
for c in tag.contents:
if isinstance(c, Tag) and c.name == 'div' and 'my_class_of_interest' in c.get('class', []):
continue
elif isinstance(c, Tag):
clear(c)
else:
c.replace_with('')
clear(soup.select_one('div.a'))
print('-' * 80)
# print soup after clearing:
print(soup.prettify())
印刷品:
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>
--------------------------------------------------------------------------------
<div class="a">
<div>
<div class="my_class_of_interest">
....several levels deeper...
</div>
</div>
</div>
另一个选项,使用lxml:
import lxml.html as lh
interest = """your html above"""
doc = lh.fromstring(interest)
retain = ''
for d in doc.xpath('//*'):
if d.attrib and d.attrib.values()[0]=="my_class_of_interest":
retain += d.text
d.text =""
d.tail=""
for target in doc.xpath('//div[@class="my_class_of_interest"]'):
target.text=retain
print(lh.tostring(doc).decode())
输出:
<div class="a"><div><div class="my_class_of_interest">
....several levels deeper...
</div></div></div>
另一个选项,使用lxml:
import lxml.html as lh
interest = """your html above"""
doc = lh.fromstring(interest)
retain = ''
for d in doc.xpath('//*'):
if d.attrib and d.attrib.values()[0]=="my_class_of_interest":
retain += d.text
d.text =""
d.tail=""
for target in doc.xpath('//div[@class="my_class_of_interest"]'):
target.text=retain
print(lh.tostring(doc).decode())
输出:
<div class="a"><div><div class="my_class_of_interest">
....several levels deeper...
</div></div></div>
soup.find\u alldiv,class=my\u class\u of\u interest我想保留页面的结构,只需清除周围div的内容。页面的结构会保留下来,因为您选择的div仅是您感兴趣的div只是为了确认,您基本上对删除文档中的所有文本感兴趣,除了目标div中的文本?是的。我发现仅仅提取div似乎会弄乱页面,所以我想保留结构并删除其他内容,看看这是否有帮助。find_alldiv,class_u=my_class_of_interest我想保留页面的结构,只需清除周围div的内容。页面的结构会保留下来,因为您选择的div仅是您感兴趣的div。只需确认,您基本上有兴趣删除文档中的所有文本,但目标div中的文本除外?是的。我发现仅仅提取div似乎会弄乱页面,所以我想保留结构并删除其他内容,看看这是否有帮助