如何防止python BeautifulSoup用十六进制代码替换转义序列？_Python_Beautifulsoup

如何防止python BeautifulSoup用十六进制代码替换转义序列？

python

如何防止python BeautifulSoup用十六进制代码替换转义序列？,python,beautifulsoup,Python,Beautifulsoup,我正在尝试在python脚本中使用BeautifulSoup，它可以帮助我避免在IBMIDA（InfoSphereDataArchitect）ldm（逻辑数据模型）文件中进行大规模更新的手动工作，这些文件实际上是xml。除了一些副作用外，它对我很有效。xml中的description属性可以包含一些格式，其中包含编码为转义序列的控制字符，如，，。在我的脚本中输出时，它们被转换为十六进制0D0A09。我不知道如何避免它。为了说明效果，我简化了脚本，使其只读取模型并将其写入另

我正在尝试在python脚本中使用BeautifulSoup，它可以帮助我避免在IBMIDA（InfoSphereDataArchitect）ldm（逻辑数据模型）文件中进行大规模更新的手动工作，这些文件实际上是xml。除了一些副作用外，它对我很有效。xml中的description属性可以包含一些格式，其中包含编码为转义序列的控制字符，如

，

。在我的脚本中输出时，它们被转换为十六进制

0D

0A

。我不知道如何避免它。为了说明效果，我简化了脚本，使其只读取模型并将其写入另一个文件

from bs4 import BeautifulSoup
#import os

source_modlel_file_name="TestModel.ldm"
target_model_file_name="TestModel_out.ldm"

with open(source_modlel_file_name,'r',encoding="utf-8",newline="\r\n") as source_model_file:
    source_model = source_model_file.read()

soup_model=BeautifulSoup(source_model, "xml")

with open(target_model_file_name, "w",encoding="utf-8",newline="\r\n") as file:
    file.write(str(soup_model))

一种解决方案是使用自定义格式化程序：

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter


class CustomAttributes(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            v = v.replace("\r", "&#xD;")
            v = v.replace("\n", "&#xA;")
            v = v.replace("\t", "&#x9;")
            yield k, v


xml_doc = """<test>
    <data description="Some Text &#xD; &#xA; &#x9;">
        some data
    </data>
</test>"""

soup = BeautifulSoup(xml_doc, "xml")

print(soup.prettify(formatter=CustomAttributes()))

从bs4导入美化组
从bs4.formatter导入HTMLFormatter
类CustomAttributes（HTMLFormatter）：
def属性（自身、标记）：
对于tag.attrs.items（）中的k，v：
v=v.替换（“\r”、“；”）
v=v.替换（“\n”、“xA；”）
v=v.替换（“\t”，“和#x9；”）
产量k，v
xml_doc=”“”
一些数据
"""
soup=BeautifulSoup（xml_doc，“xml”）
打印（soup.prettify（格式化程序=CustomAttributes（））

印刷品：

<?xml version="1.0" encoding="utf-8"?>
<test>
 <data description="Some Text &#xD; &#xA; &#x9;">
  some data
 </data>
</test>


一些数据

谢谢你，安德烈。你的补丁正在运行。但我更喜欢的是一些更普遍的解决方案。不是将BeautifulSoup转换的内容转换回，而是禁用这种转换。使用这种方法，我需要知道&#…）的完整列表；序列，并在自定义格式化程序中对其进行解码。我也有点担心HTMLFormatter模块。根据名称，它是为HTML设计的，但我正在使用XML。我不确定这是否有什么不同。我的主要观点是，对于我的脚本，我不想触及与我想做的更改无关的东西。例如，我需要删除某些特定表中所有列的label属性，但作为一个副作用，我在整个模型中得到了一些格式更改，这些表完全不在我的范围之内。