Python 将多个标记与lxml组合
我有一个html文件,看起来像:Python 将多个标记与lxml组合,python,html,xpath,lxml,Python,Html,Xpath,Lxml,我有一个html文件,看起来像: ... <p> <strong>This is </strong> <strong>a lin</strong> <strong>e which I want to </strong> <strong>join.</strong> </p> <p> 2. <s
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
通过这些代码,我能够去除所需零件中的强标记,给出:
<p>
This is a line which I want to join.
</p>
这是我想加入的队伍。
所以现在我只需要一种方法将标签放回…我可以用bs4(BeautifulSoup)来完成这项工作: 印刷品:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
这是我想加入的一行。
但不要
触摸此按钮
我已经设法解决了自己的问题
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
特别感谢@Scott,他帮助我解决了这个问题。虽然我不能将他的答案标记为正确,但我同样欣赏他的指导。或者,您可以使用更具体的xpath直接获取目标
p
元素:
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
使用xpath的简要说明:
:在XML/HTML文档中的任意位置查找具有子元素//p[strong]
的strong
元素p
:…并且除了[not(*[not(self::strong)]]
之外没有子元素strong
:…并且没有非空文本节点子节点[非(text()[normalize-space()])]
:从当前上下文元素中获取所有文本节点,并使用规范化为单个空格的连续空格连接normalize-space()
这是
等等,还是
丢失了?@Scott谢谢你指出,是的,我错过了。将编辑我的问题。@har07请查看我的更新。非常感谢。你的代码给了我相当多的直觉-请参阅我的更新。不幸的是,我没有使用BeautifulSoup(应该在前面提到)。一旦我有了名声,我肯定会投票支持你的答案。@lpoung同样,我不使用lxml,尽管它可能有一个“替换标签”功能。BeautifulSoup只是一个快速的pip安装…干得好+谢谢你回答自己的问题。您可以接受(单击上/下箭头下方的复选标记)您自己的答案。太好了!我将来可能需要它,同时深入lxml。解释得也很好!
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content