使用Python、lxml和etree读取和写入HTML/XML（字节字符串）的正确方法是什么？_Html_Python 3.x_Character Encoding_Lxml_Elementtree

使用Python、lxml和etree读取和写入HTML/XML（字节字符串）的正确方法是什么？

html python-3.x character-encoding

使用Python、lxml和etree读取和写入HTML/XML（字节字符串）的正确方法是什么？,html,python-3.x,character-encoding,lxml,elementtree,Html,Python 3.x,Character Encoding,Lxml,Elementtree,编辑：现在问题解决了，我意识到它更多地与正确读取/写入字节字符串有关，而不是HTML。希望这能让其他人更容易找到这个答案我有一个格式不好的HTML文件。我想使用一个Python库来使它整洁它似乎应该像下面这样简单： import sys from lxml import etree, html #read the unformatted HTML with open('C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report

编辑：现在问题解决了，我意识到它更多地与正确读取/写入字节字符串有关，而不是HTML。希望这能让其他人更容易找到这个答案

我有一个格式不好的HTML文件。我想使用一个Python库来使它整洁

它似乎应该像下面这样简单：

import sys
from lxml import etree, html

#read the unformatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html', 'r', encoding='utf-8') as file:
    #write the pretty XML to a file
    file_text = ''.join(file.readlines())

#format the HTML
document_root = html.fromstring(file_text)
document = etree.tostring(document_root, pretty_print=True)

#write the nice, pretty, formatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/Pretty.html', 'w') as file:
    #write the pretty XML to a file
    file.write(document)

但是这段代码抱怨

文件行

不是一个字符串或类似字节的对象。好吧，我想函数不能接受列表是有道理的

但是，它是“字节”而不是字符串。没问题，

str（文档）

但我得到的HTML中满是“\n”而不是换行符。。。它们是一条斜线，后跟一个en。结果中没有实际的回车，只是一条长长的线

我尝试过其他一些奇怪的事情，比如指定编码、尝试解码等等，但都没有产生预期的结果

读取和写入此类（非ASCII是正确的术语吗？）文本的正确方法是什么？

您缺少从etree的tostring方法获取字节的信息，在将（bytestring）写入文件时需要考虑到这一点。像这样使用

open

函数中的

开关，忘记

str（）

转换：

with open('Pretty.html', 'wb') as file:
    #write the pretty XML to a file
    file.write(document)

附录

尽管这个答案解决了眼前的问题，并介绍了bytestring，但by是将lxml etrees写入文件的更干净、更快的方法。

您缺少从etree的tostring方法获取字节的信息，在将（bytestring）写入文件时需要考虑到这一点。像这样使用

open

函数中的

开关，忘记

str（）

转换：

with open('Pretty.html', 'wb') as file:
    #write the pretty XML to a file
    file.write(document)

附录

尽管这个答案解决了眼前的问题，并介绍了ByTestring，但by是将lxml etrees写入文件的更干净、更快的方法。

这可以在几行代码中使用lxml来完成，而无需使用open。write方法正是您想要做的：

# parse using file name which is the also the recommended way.
tree = html.parse("C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html")
# call write on the tree
tree.write("C:/Users/mhurley/Portable_Python/notebooks/Pretty.html", pretty_print=True, encoding="utf=8")

另外，

file\u text='''.join（file.readlines（））

与

file\u text=file.read（）

这可以在几行代码中使用lxml来完成，而无需使用open，.write方法完全适合您尝试执行的操作：

# parse using file name which is the also the recommended way.
tree = html.parse("C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html")
# call write on the tree
tree.write("C:/Users/mhurley/Portable_Python/notebooks/Pretty.html", pretty_print=True, encoding="utf=8")

另外，

file\u text='''.join（file.readlines（））

与

file\u text=file.read（）

我也注意到了这一点，但问题仍然是：当我将其写入文件时，如何使其不“变得有趣”？太好了！谢谢我不知道“wb”是编写文件的有效模式。这很好用……这可能也解释了为什么我的输出文件以“b”开头，并完全用一组单引号括起来。我觉得这很奇怪，但我不认为这有什么意义。我也注意到了这一点，但问题是：当我把它写到一个文件中时，如何使它不“变得有趣”？太好了！谢谢我不知道“wb”是编写文件的有效模式。这很好用……这可能也解释了为什么我的输出文件以“b”开头，并完全用一组单引号括起来。我觉得这很奇怪，但我觉得没有什么意义。