如何使用ElementTree将HTML标记解析为原始文本_Html_Xml_Python 3.x_Xml Parsing_Elementtree

如何使用ElementTree将HTML标记解析为原始文本

html xml python-3.x

如何使用ElementTree将HTML标记解析为原始文本,html,xml,python-3.x,xml-parsing,elementtree,Html,Xml,Python 3.x,Xml Parsing,Elementtree,我有一个在XML标记中包含HTML的文件，我希望该HTML作为原始文本，而不是作为XML标记的子项进行解析。下面是一个例子： import xml.etree.ElementTree as ET root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>") 它不返回任何输出但是root.find

我有一个在XML标记中包含HTML的文件，我希望该HTML作为原始文本，而不是作为XML标记的子项进行解析。下面是一个例子：

import xml.etree.ElementTree as ET
root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>")

它不返回任何输出

但是root.find（'text/p'）。text将返回不带标记的段落文本。我希望文本标记中的所有内容都是原始文本，但我不知道如何获取它

通过使用ET.tostring将文本标记的所有子元素附加到字符串中，我可以得到我想要的：

output_text = ""    
for child in root.find('text'):
    output_text += ET.tostring(child, encoding="unicode")

>>>output_text
>>>"<p>This is some text that I want to read</p>"

output_text=“”
对于root.find（'text'）中的子级：
output_text+=ET.tostring（child，encoding=“unicode”）
>>>输出文本
>>>“这是我想读的一些文本”

通过使用ET.tostring将文本标记的所有子元素附加到字符串中，我可以得到我想要的内容：

output_text = ""    
for child in root.find('text'):
    output_text += ET.tostring(child, encoding="unicode")

>>>output_text
>>>"<p>This is some text that I want to read</p>"

output_text=“”
对于root.find（'text'）中的子级：
output_text+=ET.tostring（child，encoding=“unicode”）
>>>输出文本
>>>“这是我想读的一些文本”

通过使用ET.tostring将文本标记的所有子元素附加到字符串中，我可以得到我想要的内容：

output_text = ""    
for child in root.find('text'):
    output_text += ET.tostring(child, encoding="unicode")

>>>output_text
>>>"<p>This is some text that I want to read</p>"

output_text=“”
对于root.find（'text'）中的子级：
output_text+=ET.tostring（child，encoding=“unicode”）
>>>输出文本
>>>“这是我想读的一些文本”

通过使用ET.tostring将文本标记的所有子元素附加到字符串中，我可以得到我想要的内容：

output_text = ""    
for child in root.find('text'):
    output_text += ET.tostring(child, encoding="unicode")

>>>output_text
>>>"<p>This is some text that I want to read</p>"

output_text=“”
对于root.find（'text'）中的子级：
output_text+=ET.tostring（child，encoding=“unicode”）
>>>输出文本
>>>“这是我想读的一些文本”

是合理的。元素对象是子元素的列表。元素对象的

.text

属性仅与不属于其他（嵌套）元素的内容（通常是文本）相关

代码中有一些地方需要改进。在Python中，字符串连接是一种昂贵的操作。最好构建子字符串列表，并在以后加入它们——如下所示：

output_lst = []  
for child in root.find('text'):
    output_lst.append(ET.tostring(child, encoding="unicode"))

output_text = ''.join(output_lst)

还可以使用Python列表理解结构构建列表，因此代码将更改为：

output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]  
output_text = ''.join(output_lst)

.join

可以使用任何生成字符串的iterable。这样，列表就不需要提前构建。相反，可以使用生成器表达式（即可以在列表的

[]

中看到的表达式）：

output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))

一行代码可以格式化为多行，以使其更具可读性：

output_text = ''.join(ET.tostring(child, encoding="unicode")
                      for child in root.find('text'))

这是合理的。元素对象是子元素的列表。元素对象的

.text

属性仅与不属于其他（嵌套）元素的内容（通常是文本）相关

代码中有一些地方需要改进。在Python中，字符串连接是一种昂贵的操作。最好构建子字符串列表，并在以后加入它们——如下所示：

output_lst = []  
for child in root.find('text'):
    output_lst.append(ET.tostring(child, encoding="unicode"))

output_text = ''.join(output_lst)

还可以使用Python列表理解结构构建列表，因此代码将更改为：

output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]  
output_text = ''.join(output_lst)

.join

可以使用任何生成字符串的iterable。这样，列表就不需要提前构建。相反，可以使用生成器表达式（即可以在列表的

[]

中看到的表达式）：

output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))

一行代码可以格式化为多行，以使其更具可读性：

output_text = ''.join(ET.tostring(child, encoding="unicode")
                      for child in root.find('text'))

这是合理的。元素对象是子元素的列表。元素对象的

.text

属性仅与不属于其他（嵌套）元素的内容（通常是文本）相关

代码中有一些地方需要改进。在Python中，字符串连接是一种昂贵的操作。最好构建子字符串列表，并在以后加入它们——如下所示：

output_lst = []  
for child in root.find('text'):
    output_lst.append(ET.tostring(child, encoding="unicode"))

output_text = ''.join(output_lst)

还可以使用Python列表理解结构构建列表，因此代码将更改为：

output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]  
output_text = ''.join(output_lst)

.join

可以使用任何生成字符串的iterable。这样，列表就不需要提前构建。相反，可以使用生成器表达式（即可以在列表的

[]

中看到的表达式）：

output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))

一行代码可以格式化为多行，以使其更具可读性：

output_text = ''.join(ET.tostring(child, encoding="unicode")
                      for child in root.find('text'))

这是合理的。元素对象是子元素的列表。元素对象的

.text

属性仅与不属于其他（嵌套）元素的内容（通常是文本）相关

代码中有一些地方需要改进。在Python中，字符串连接是一种昂贵的操作。最好构建子字符串列表，并在以后加入它们——如下所示：

output_lst = []  
for child in root.find('text'):
    output_lst.append(ET.tostring(child, encoding="unicode"))

output_text = ''.join(output_lst)

还可以使用Python列表理解结构构建列表，因此代码将更改为：

output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]  
output_text = ''.join(output_lst)

.join

可以使用任何生成字符串的iterable。这样，列表就不需要提前构建。相反，可以使用生成器表达式（即可以在列表的

[]

中看到的表达式）：

output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))

一行代码可以格式化为多行，以使其更具可读性：

output_text = ''.join(ET.tostring(child, encoding="unicode")
                      for child in root.find('text'))

是的，我提供的答案呢？这看起来很简单吧？对不起，我想我最初的要求不是很清楚。我希望在输出文本字符串中包含“”标记（或任何其他html标记），而不仅仅是标记的内部文本。是的，我提供的答案如何？这看起来很简单吧？对不起，我想我最初的要求不是很清楚。我希望在输出文本字符串中包含“”标记（或任何其他html标记），而不仅仅是标记的内部文本。是的，我提供的答案如何？这看起来很简单吧？对不起，我想我最初的要求不是很清楚。我希望在输出文本字符串中包含“”标记（或任何其他html标记），而不仅仅是标记的内部文本。是的，我提供的答案如何？这看起来很简单吧？对不起，我想我最初的要求不是很清楚。我希望在输出文本字符串中包含“”标记（或任何其他html标记），而不是onl