如何通过Python中的LXML引用父元素并删除RSS XML中的父元素？_Python_Xml_Rss_Lxml

如何通过Python中的LXML引用父元素并删除RSS XML中的父元素？

python xml rss

如何通过Python中的LXML引用父元素并删除RSS XML中的父元素？,python,xml,rss,lxml,Python,Xml,Rss,Lxml,我一直很难破解这个。我有一个XML文件形式的RSS提要。简而言之，它看起来是这样的： <rss version="2.0"> <channel> <title>My RSS Feed</title> <link href="https://www.examplefeedurl.com">Feed</link> <description></descri

我一直很难破解这个。我有一个XML文件形式的RSS提要。简而言之，它看起来是这样的：

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>

它只删除了第二个描述标签，这很有意义，但我希望整个项目消失。如果我只有'desc'引用，我不知道如何才能获得'item'元素

我尝试过谷歌搜索，也尝试过在这里搜索，但我看到的情况只是想像现在这样删除标记，奇怪的是，我没有偶然发现想要删除整个父对象的示例代码。非常欢迎任何指向文档/教程或帮助的指针。

考虑一下，这是一种专门用于转换XML文件的语言，例如按值有条件地删除节点。Python的lxml可以运行XSLT1.0脚本，甚至可以将参数从Python脚本传递到XSLT，这与在SQL！中传递参数很相似！。这样，您就避免了任何for循环或if逻辑，也避免了在应用层重建树

XSLT另存为.xsl文件，一个特殊的.xml文件

Python演示版，下面使用发布的示例运行两个搜索

我非常喜欢XSLT，但另一种选择是只选择项目而不是描述，选择要删除的元素；不是它的孩子

此外，如果使用xpath，可以直接在xpath谓词中检查禁止字符串

例如

from lxml import etree

testString = """
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
"""

forbidden_string = "I want to get rid of the whole item"

parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))

for item in found:
    item.getparent().remove(item)

print(etree.tostring(doc, encoding="unicode", pretty_print=True))

这张照片

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description/>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>...</item>
    </channel>
</rss>

由于您仅将模板应用于频道上下文中的项，因此您将丢失频道的所有其他子项，如标题、说明和链接。我要做的是删除模板匹配频道并添加模板匹配项。由于在XSLT1.0中不能在匹配模式中引用参数/变量，因此我将添加一个xsl:if测试，其中notdescription[contains.，$search_string]不测试描述的位置，如果它为true，输出项目xsl:copy w/xsl:apply templates以保持其推送样式。除了检查禁止的字符串外，我必须对description标记中的文本进行更多的逻辑处理。但是你使用item元素的技巧让我走上了正确的轨道，我使用了item元素，得到了ChildElementIterator，使用了我的逻辑，我可以调用item来删除它，就像你的例子中那样！谢谢！

import lxml.etree as et

# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)

print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>
#       <guid/>
#       <pubDate/>
#       <author/>
#       <title>Title of the item</title>
#       <link href="https://example.com" rel="alternate" type="text/html"/>
#       <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
#       <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
#     </item>
#     <item>...</item>
#   </channel>
# </rss>

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)

print(result)    
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#   </channel>
# </rss>

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

from lxml import etree

testString = """
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
"""

forbidden_string = "I want to get rid of the whole item"

parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))

for item in found:
    item.getparent().remove(item)

print(etree.tostring(doc, encoding="unicode", pretty_print=True))

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description/>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>...</item>
    </channel>
</rss>