Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何通过Python中的LXML引用父元素并删除RSS XML中的父元素?_Python_Xml_Rss_Lxml - Fatal编程技术网

如何通过Python中的LXML引用父元素并删除RSS XML中的父元素?

如何通过Python中的LXML引用父元素并删除RSS XML中的父元素?,python,xml,rss,lxml,Python,Xml,Rss,Lxml,我一直很难破解这个。我有一个XML文件形式的RSS提要。简而言之,它看起来是这样的: <rss version="2.0"> <channel> <title>My RSS Feed</title> <link href="https://www.examplefeedurl.com">Feed</link> <description></descri

我一直很难破解这个。我有一个XML文件形式的RSS提要。简而言之,它看起来是这样的:

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
它只删除了第二个描述标签,这很有意义,但我希望整个项目消失。 如果我只有'desc'引用,我不知道如何才能获得'item'元素

我尝试过谷歌搜索,也尝试过在这里搜索,但我看到的情况只是想像现在这样删除标记,奇怪的是,我没有偶然发现想要删除整个父对象的示例代码。 非常欢迎任何指向文档/教程或帮助的指针。

考虑一下,这是一种专门用于转换XML文件的语言,例如按值有条件地删除节点。Python的lxml可以运行XSLT1.0脚本,甚至可以将参数从Python脚本传递到XSLT,这与在SQL!中传递参数很相似!。这样,您就避免了任何for循环或if逻辑,也避免了在应用层重建树

XSLT另存为.xsl文件,一个特殊的.xml文件

Python演示版,下面使用发布的示例运行两个搜索


我非常喜欢XSLT,但另一种选择是只选择项目而不是描述,选择要删除的元素;不是它的孩子

此外,如果使用xpath,可以直接在xpath谓词中检查禁止字符串

例如

from lxml import etree

testString = """
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
"""

forbidden_string = "I want to get rid of the whole item"

parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))

for item in found:
    item.getparent().remove(item)

print(etree.tostring(doc, encoding="unicode", pretty_print=True))
这张照片

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description/>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>...</item>
    </channel>
</rss>

由于您仅将模板应用于频道上下文中的项,因此您将丢失频道的所有其他子项,如标题、说明和链接。我要做的是删除模板匹配频道并添加模板匹配项。由于在XSLT1.0中不能在匹配模式中引用参数/变量,因此我将添加一个xsl:if测试,其中notdescription[contains.,$search_string]不测试描述的位置,如果它为true,输出项目xsl:copy w/xsl:apply templates以保持其推送样式。除了检查禁止的字符串外,我必须对description标记中的文本进行更多的逻辑处理。但是你使用item元素的技巧让我走上了正确的轨道,我使用了item元素,得到了ChildElementIterator,使用了我的逻辑,我可以调用item来删除它,就像你的例子中那样!谢谢!
import lxml.etree as et

# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)

print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>
#       <guid/>
#       <pubDate/>
#       <author/>
#       <title>Title of the item</title>
#       <link href="https://example.com" rel="alternate" type="text/html"/>
#       <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
#       <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
#     </item>
#     <item>...</item>
#   </channel>
# </rss>

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)

print(result)    
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#   </channel>
# </rss>

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)
from lxml import etree

testString = """
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
"""

forbidden_string = "I want to get rid of the whole item"

parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))

for item in found:
    item.getparent().remove(item)

print(etree.tostring(doc, encoding="unicode", pretty_print=True))
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description/>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>...</item>
    </channel>
</rss>