Python 如何防止lxml删除doctype_Python_Beautifulsoup_Lxml

Python 如何防止lxml删除doctype

python

Python 如何防止lxml删除doctype,python,beautifulsoup,lxml,Python,Beautifulsoup,Lxml,先讲点背景知识。我希望有一个自定义的html类，在这个类中我可以美化html和其他东西（在下面的代码中没有公开）我确实喜欢LXML库，如果我知道如何用定制的压痕正确地美化HTML，我甚至不会考虑使用漂亮的汤，不幸的是，我不这样做了，我想出了一小段慢而模糊的代码： import lxml.html from bs4 import BeautifulSoup def write_new_line(line, current_indent, indent): new_line = &qu

先讲点背景知识。我希望有一个自定义的html类，在这个类中我可以美化html和其他东西（在下面的代码中没有公开）

<>我确实喜欢LXML库，如果我知道如何用定制的压痕正确地美化HTML，我甚至不会考虑使用漂亮的汤，不幸的是，我不这样做了，我想出了一小段慢而模糊的代码：

import lxml.html
from bs4 import BeautifulSoup


def write_new_line(line, current_indent, indent):
    new_line = ""
    spaces_to_add = (current_indent * indent) - current_indent
    if spaces_to_add > 0:
        for i in range(spaces_to_add):
            new_line += " "
    new_line += str(line) + "\n"
    return new_line


def prettify_html(content, indent=4):
    soup = BeautifulSoup(content, "html.parser")
    pretty_soup = str()
    previous_indent = 0
    for line in soup.prettify().split("\n"):
        current_indent = str(line).find("<")
        if current_indent == -1 or current_indent > previous_indent + 2:
            current_indent = previous_indent + 1
        previous_indent = current_indent
        pretty_soup += write_new_line(line, current_indent, indent)
    return pretty_soup.strip()


class Html:
    def __init__(self, string_or_html):
        if isinstance(string_or_html, str):
            self.html = lxml.html.fromstring(string_or_html)
        else:
            self.html = string_or_html

    def __str__(self):
        return prettify_html(lxml.html.tostring(self.html).decode("utf-8"), indent=4)


if __name__ == "__main__":
    import textwrap

    html = textwrap.dedent(
        """
        <!DOCTYPE html>
        <html lang="en">
            <head>
            </head>
            <body>
            </body>
        </html>
    """
    ).strip()

    print("broken_code".center(80, "-"))
    print(Html(html))

    print("good_code".center(80, "-"))
    print(prettify_html(html))

import lxml.html
从bs4导入BeautifulSoup
def write_new_行（行、当前缩进、缩进）：
new_line=“”
空格_to_add=（当前缩进*缩进）-当前缩进
如果空间添加>0：
对于范围内的i（空格\u到\u添加）：
新建_行+=“”
new_line+=str（line）+“\n”
返回新行
def美化_html（内容，缩进=4）：
soup=BeautifulSoup（内容为“html.parser”）
靓汤
上一个缩进=0
对于汤中的线。prettify（）.split（“\n”）：
current_indent=str（line）.find（“虽然我没有完全理解您的观点，但这里有一个您可能需要的实现：
class Html:
    def __init__(self, string_or_html):
        if isinstance(string_or_html, str):
            self.html = lxml.html.fromstring(string_or_html)
        else:
            self.html = string_or_html

    def __str__(self):
        doctype = self.html.getroottree().docinfo.doctype
        return lxml.html.tostring(self.html, pretty_print=True, encoding="unicode", doctype=doctype)

如果您最终需要调用beautifulsop
，为什么需要单独使用它们？为什么不直接使用soup=beautifulsop（内容，“lxml”）？这真的很有趣，我甚至不知道这是可能的。正如我在问题中提到的，为什么要使用lxml是因为我想用它来完成其他任务，这样就摆脱了beautifulsoup依赖，同时仍然能够美化html（具有适当的缩进）这里的主要目标是…我将阅读相关内容，以了解到底做了什么，尽管从名称来看，BS似乎会以某种方式使用lxml作为后端解析器。这真的很有趣，我稍后将对其进行测试，但从名称来看，它似乎会保留原始html内容？这太棒了…您知道如何在不使用usin的情况下美化字符串吗g BeautifulSoup？无论如何，正如我所说，我稍后会测试它…同时，+1（假设它将保持原始doctype）
class Html:
    def __init__(self, string_or_html):
        if isinstance(string_or_html, str):
            self.html = lxml.html.fromstring(string_or_html)
        else:
            self.html = string_or_html

    def __str__(self):
        doctype = self.html.getroottree().docinfo.doctype
        return lxml.html.tostring(self.html, pretty_print=True, encoding="unicode", doctype=doctype)