如何通过Python中的LXML引用父元素并删除RSS XML中的父元素?
我一直很难破解这个。我有一个XML文件形式的RSS提要。简而言之,它看起来是这样的:如何通过Python中的LXML引用父元素并删除RSS XML中的父元素?,python,xml,rss,lxml,Python,Xml,Rss,Lxml,我一直很难破解这个。我有一个XML文件形式的RSS提要。简而言之,它看起来是这样的: <rss version="2.0"> <channel> <title>My RSS Feed</title> <link href="https://www.examplefeedurl.com">Feed</link> <description></descri
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
它只删除了第二个描述标签,这很有意义,但我希望整个项目消失。
如果我只有'desc'引用,我不知道如何才能获得'item'元素
我尝试过谷歌搜索,也尝试过在这里搜索,但我看到的情况只是想像现在这样删除标记,奇怪的是,我没有偶然发现想要删除整个父对象的示例代码。
非常欢迎任何指向文档/教程或帮助的指针。考虑一下,这是一种专门用于转换XML文件的语言,例如按值有条件地删除节点。Python的lxml可以运行XSLT1.0脚本,甚至可以将参数从Python脚本传递到XSLT,这与在SQL!中传递参数很相似!。这样,您就避免了任何for循环或if逻辑,也避免了在应用层重建树
XSLT另存为.xsl文件,一个特殊的.xml文件
Python演示版,下面使用发布的示例运行两个搜索
我非常喜欢XSLT,但另一种选择是只选择项目而不是描述,选择要删除的元素;不是它的孩子 此外,如果使用xpath,可以直接在xpath谓词中检查禁止字符串 例如
from lxml import etree
testString = """
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
"""
forbidden_string = "I want to get rid of the whole item"
parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))
for item in found:
item.getparent().remove(item)
print(etree.tostring(doc, encoding="unicode", pretty_print=True))
这张照片
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description/>
<item>...</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
</rss>
由于您仅将模板应用于频道上下文中的项,因此您将丢失频道的所有其他子项,如标题、说明和链接。我要做的是删除模板匹配频道并添加模板匹配项。由于在XSLT1.0中不能在匹配模式中引用参数/变量,因此我将添加一个xsl:if测试,其中notdescription[contains.,$search_string]不测试描述的位置,如果它为true,输出项目xsl:copy w/xsl:apply templates以保持其推送样式。除了检查禁止的字符串外,我必须对description标记中的文本进行更多的逻辑处理。但是你使用item元素的技巧让我走上了正确的轨道,我使用了item元素,得到了ChildElementIterator,使用了我的逻辑,我可以调用item来删除它,就像你的例子中那样!谢谢!
import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>
# <guid/>
# <pubDate/>
# <author/>
# <title>Title of the item</title>
# <link href="https://example.com" rel="alternate" type="text/html"/>
# <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
# <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
# </item>
# <item>...</item>
# </channel>
# </rss>
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# </channel>
# </rss>
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
from lxml import etree
testString = """
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
"""
forbidden_string = "I want to get rid of the whole item"
parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))
for item in found:
item.getparent().remove(item)
print(etree.tostring(doc, encoding="unicode", pretty_print=True))
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description/>
<item>...</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
</rss>