Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/339.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将多个标记与lxml组合_Python_Html_Xpath_Lxml - Fatal编程技术网

Python 将多个标记与lxml组合

Python 将多个标记与lxml组合,python,html,xpath,lxml,Python,Html,Xpath,Lxml,我有一个html文件,看起来像: ... <p> <strong>This is </strong> <strong>a lin</strong> <strong>e which I want to </strong> <strong>join.</strong> </p> <p> 2. <s

我有一个html文件,看起来像:

...
<p>  
    <strong>This is </strong>  
    <strong>a lin</strong>  
    <strong>e which I want to </strong>  
    <strong>join.</strong>  
</p>
<p>
    2.
    <strong>But do not </strong>
    <strong>touch this</strong>
    <em>Maybe some other tags as well.</em>
    bla bla blah...
</p>
...
通过这些代码,我能够去除所需零件中的强标记,给出:

<p>
      This is a line which I want to join.  
</p>  

这是我想加入的队伍。


所以现在我只需要一种方法将标签放回…

我可以用bs4(BeautifulSoup)来完成这项工作:

印刷品:

<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
这是我想加入的一行。

但不要 触摸此按钮


我已经设法解决了自己的问题

for p in self.tree.xpath('//body/p'):
    if p.tail is None:  # some conditions specifically for my doc 
        children = p.getchildren()
        if len(children)>1:
            for child in children:
                #if other stuffs present, break
                if child.tag!='strong' or child.tail is not None: 
                    break
            else:
                # If not break, we find a p block to fix
                # Get rid of stuffs inside p, and put a SubElement in
                etree.strip_tags(p,'strong')
                tmp_text = p.text_content()
                p.clear()
                subtext = etree.SubElement(p, "strong")
                subtext.text = tmp_text

特别感谢@Scott,他帮助我解决了这个问题。虽然我不能将他的答案标记为正确,但我同样欣赏他的指导。

或者,您可以使用更具体的xpath直接获取目标
p
元素:

p_target = """
//p[strong]
   [not(*[not(self::strong)])]
   [not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
    #logic inside the loop can also be the same as your `else` block
    content = p.xpath("normalize-space()")
    p.clear()
    strong = etree.SubElement(p, "strong")
    strong.text = content
使用xpath的简要说明:

  • //p[strong]
    :在XML/HTML文档中的任意位置查找具有子元素
    strong
    p
    元素
  • [not(*[not(self::strong)]]
    :…并且除了
    strong
    之外没有子元素
  • [非(text()[normalize-space()])]
    :…并且没有非空文本节点子节点
  • normalize-space()
    :从当前上下文元素中获取所有文本节点,并使用规范化为单个空格的连续空格连接

第一块应该是
这是
等等,还是
丢失了?@Scott谢谢你指出,是的,我错过了。将编辑我的问题。@har07请查看我的更新。非常感谢。你的代码给了我相当多的直觉-请参阅我的更新。不幸的是,我没有使用BeautifulSoup(应该在前面提到)。一旦我有了名声,我肯定会投票支持你的答案。@lpoung同样,我不使用lxml,尽管它可能有一个“替换标签”功能。BeautifulSoup只是一个快速的pip安装…干得好+谢谢你回答自己的问题。您可以接受(单击上/下箭头下方的复选标记)您自己的答案。太好了!我将来可能需要它,同时深入lxml。解释得也很好!
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
for p in self.tree.xpath('//body/p'):
    if p.tail is None:  # some conditions specifically for my doc 
        children = p.getchildren()
        if len(children)>1:
            for child in children:
                #if other stuffs present, break
                if child.tag!='strong' or child.tail is not None: 
                    break
            else:
                # If not break, we find a p block to fix
                # Get rid of stuffs inside p, and put a SubElement in
                etree.strip_tags(p,'strong')
                tmp_text = p.text_content()
                p.clear()
                subtext = etree.SubElement(p, "strong")
                subtext.text = tmp_text
p_target = """
//p[strong]
   [not(*[not(self::strong)])]
   [not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
    #logic inside the loop can also be the same as your `else` block
    content = p.xpath("normalize-space()")
    p.clear()
    strong = etree.SubElement(p, "strong")
    strong.text = content