如何使用python和lxml使用lxml.html.document_fromtstring（）删除标记内容_Python_Xml_Lxml.html

如何使用python和lxml使用lxml.html.document_fromtstring（）删除标记内容

python xml

如何使用python和lxml使用lxml.html.document_fromtstring（）删除标记内容,python,xml,lxml.html,Python,Xml,Lxml.html,我想删除标记之间存在的内容。我查看了remove（），strip_elements（）以及beautifulsou方法，但是我看到的所有示例都只包含一个标记，比如只包含，然后我想删除之间的所有内容，包括标记 from lxml import etree cur.execute('SELECT Title, Body FROM posts') for item in cur: record = list(item) doc = etree.fromstring(record

我想删除
标记之间存在的内容。我查看了
remove（）
，
strip_elements（）
以及
beautifulsou
方法，但是我看到的所有示例都只包含一个标记，比如只包含
，然后我想删除
之间的所有内容，包括标记

from lxml import etree cur.execute('SELECT Title, Body FROM posts') for item in cur: record = list(item) doc = etree.fromstring(record[1]) # error thrown here for node in doc.xpath('pre[code]'): doc.remove(node) record[1] = etree.tostring(doc) page = lxml.html.document_fromstring(record[1]) record[0] = str(record[0]) record[1] = str(page.text_content()) # Stripping HTML Tags print record[1]
编辑的代码： 下面给出了我的代码，但它抛出错误为
lxml.etree.XMLSyntaxError:Extra content位于文档末尾的doc=doc=etree.fromstring（记录[1]）：更新：我知道我使用的XML格式不是标准格式，因此我需要使用lxml.html.document\u fromtstring（）来删除标记内容，而不是etree.fromtstring（）。由于我找不到lxml.html.document_fromtstring（）的任何实现，有人能给我举个例子吗删除标记的内容。因此，如果我输入东西一些代码一些文本您希望输出什么？@roippi我唯一的目标是，如果我有类似的东西，那么之间的所有东西都应该删除，包括标记alsoSo，您想对任何具有直接code 子代的prev 标记的全部内容进行核爆。是的。这正是stackoverflow posts中存在格式化块代码的方式，我想从posts中删除该块代码。您是指预格式化文本标记的？ from lxml import etree cur.execute('SELECT Title, Body FROM posts') for item in cur: record = list(item) doc = etree.fromstring(record[1]) # error thrown here for node in doc.xpath('pre[code]'): doc.remove(node) record[1] = etree.tostring(doc) page = lxml.html.document_fromstring(record[1]) record[0] = str(record[0]) record[1] = str(page.text_content()) # Stripping HTML Tags print record[1]