Python 将多个标记与lxml组合_Python_Html_Xpath_Lxml

Python 将多个标记与lxml组合

python html xpath

Python 将多个标记与lxml组合,python,html,xpath,lxml,Python,Html,Xpath,Lxml,我有一个html文件，看起来像： ... This is a lin e which I want to join. 2. <s

我有一个html文件，看起来像：

...
<p>  
    <strong>This is </strong>  
    <strong>a lin</strong>  
    <strong>e which I want to </strong>  
    <strong>join.</strong>  
</p>
<p>
    2.
    <strong>But do not </strong>
    <strong>touch this</strong>
    <em>Maybe some other tags as well.</em>
    bla bla blah...
</p>
...

通过这些代码，我能够去除所需零件中的强标记，给出：

<p>
      This is a line which I want to join.  
</p>


这是我想加入的队伍。

所以现在我只需要一种方法将标签放回…

我可以用bs4（BeautifulSoup）来完成这项工作：

印刷品：

<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>

这是我想加入的一行。

但不要
触摸此按钮

我已经设法解决了自己的问题

for p in self.tree.xpath('//body/p'):
    if p.tail is None:  # some conditions specifically for my doc 
        children = p.getchildren()
        if len(children)>1:
            for child in children:
                #if other stuffs present, break
                if child.tag!='strong' or child.tail is not None: 
                    break
            else:
                # If not break, we find a p block to fix
                # Get rid of stuffs inside p, and put a SubElement in
                etree.strip_tags(p,'strong')
                tmp_text = p.text_content()
                p.clear()
                subtext = etree.SubElement(p, "strong")
                subtext.text = tmp_text

特别感谢@Scott，他帮助我解决了这个问题。虽然我不能将他的答案标记为正确，但我同样欣赏他的指导。

或者，您可以使用更具体的xpath直接获取目标

元素：

p_target = """
//p[strong]
   [not(*[not(self::strong)])]
   [not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
    #logic inside the loop can also be the same as your `else` block
    content = p.xpath("normalize-space()")
    p.clear()
    strong = etree.SubElement(p, "strong")
    strong.text = content

使用xpath的简要说明：

```
//p[strong]
```
：在XML/HTML文档中的任意位置查找具有子元素
```
strong
```
的
```
p
```
元素

[not（*[not（self:：strong）]]

：…并且除了

strong

之外没有子元素

```
[非（text（）[normalize-space（）]）]
```
：…并且没有非空文本节点子节点
```
normalize-space（）
```
：从当前上下文元素中获取所有文本节点，并使用规范化为单个空格的连续空格连接

第一块应该是

这是

等等，还是
丢失了？@Scott谢谢你指出，是的，我错过了。将编辑我的问题。@har07请查看我的更新。非常感谢。你的代码给了我相当多的直觉-请参阅我的更新。不幸的是，我没有使用BeautifulSoup（应该在前面提到）。一旦我有了名声，我肯定会投票支持你的答案。@lpoung同样，我不使用lxml，尽管它可能有一个“替换标签”功能。BeautifulSoup只是一个快速的pip安装…干得好+谢谢你回答自己的问题。您可以接受（单击上/下箭头下方的复选标记）您自己的答案。太好了！我将来可能需要它，同时深入lxml。解释得也很好！
p_tag = soup.p p_tag.replace_with(bs(s, 'html.parser')) print soup

<html><body>This is a line which I want to join. But do not touch this </body></html>

for p in self.tree.xpath('//body/p'): if p.tail is None: # some conditions specifically for my doc children = p.getchildren() if len(children)>1: for child in children: #if other stuffs present, break if child.tag!='strong' or child.tail is not None: break else: # If not break, we find a p block to fix # Get rid of stuffs inside p, and put a SubElement in etree.strip_tags(p,'strong') tmp_text = p.text_content() p.clear() subtext = etree.SubElement(p, "strong") subtext.text = tmp_text

p_target = """ //p[strong] [not(*[not(self::strong)])] [not(text()[normalize-space()])] """ for p in self.tree.xpath(p_target): #logic inside the loop can also be the same as your `else` block content = p.xpath("normalize-space()") p.clear() strong = etree.SubElement(p, "strong") strong.text = content