美丽汤编码错误：字符映射到未定义（Python）_Python_Html_Encoding_Beautifulsoup

美丽汤编码错误：字符映射到未定义（Python）

python html encoding

美丽汤编码错误：字符映射到未定义（Python）,python,html,encoding,beautifulsoup,Python,Html,Encoding,Beautifulsoup,我已经编写了一个脚本，应该可以从站点检索html页面并更新其内容。以下函数在我的系统上查找某个文件，然后尝试打开并编辑该文件： def update_sn(files_to_update, sn, table, title): paths = files_to_update['files'] print('updating the sn') try: sn_htm = [s for s in paths if re.search('^((?!(Defaul

我已经编写了一个脚本，应该可以从站点检索html页面并更新其内容。以下函数在我的系统上查找某个文件，然后尝试打开并编辑该文件：

def update_sn(files_to_update, sn, table, title):
    paths = files_to_update['files']
    print('updating the sn')
    try:
        sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
        notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]

    except Exception:
        print('no sns were found')
        pass

    new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
    new_sn_number = sn

    htm_text = open(sn_htm, 'rb').read().decode('cp1252')
    content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S) 
    minus_content = htm_text.replace(content[0], '')
    table_soup = BeautifulSoup(table, 'html.parser')
    new_soup = BeautifulSoup(minus_content, 'html.parser')
    head_title = new_soup.title.string.replace_with(new_sn_number)
    new_soup.link.insert_after(table_soup.div.next)

    with open(new_path_name, "w+") as file:
        result = str(new_soup)
        try:
            file.write(result)
        except Exception:
            print('Met exception.  Changing encoding to cp1252')
            try:
                file.write(result('cp1252'))
            except Exception:
                print('cp1252 did\'nt work.  Changing encoding to utf-8')
                file.write(result.encode('utf8'))
                try:
                    print('utf8 did\'nt work.  Changing encoding to utf-16')
                    file.write(result.encode('utf16'))
                except Exception:
                    pass

def update\u序列号（文件到更新，序列号，表格，标题）：
路径=文件到更新['files']
打印（'更新序列号'）
尝试：
sn|U htm=[s代表重新搜索时的路径（'^（（？！（默认值|注释|最新添加）））*htm$'，s）][0]
notes_htm=[s表示重新搜索时路径中的s（“'u notes\.htm$”，s）][0]
除例外情况外：
打印（'未找到sns'）
通过
新路径名称=新路径（序号htm，文件更新['Precedure']，文件更新['original']））
新编号=编号
htm_text=open（sn_htm，'rb'）.read（）.decode（'cp1252'））
content=re.findall（r'（.*？*）（？：）'，htm_text，re.I|re.S）
减号内容=htm文本。替换（内容[0]，“”）
table_soup=BeautifulSoup（表'html.parser'）
new_soup=BeautifulSoup（减去内容“html.parser”）
head\u title=new\u soup.title.string。将\u替换为（新的\u序列号）
新建\u soup.link.insert\u after（表\u soup.div.next）
打开（新路径名称“w+”）作为文件：
结果=str（新汤）
尝试：
file.write（结果）
除例外情况外：
打印（'Met异常。将编码更改为cp1252'）
尝试：
file.write（结果（'cp1252'））
除例外情况外：
打印（'cp1252不工作。正在将编码更改为utf-8'）
file.write（result.encode（'utf8'））
尝试：
打印（'utf8不起作用。将编码更改为utf-16'）
file.write（result.encode（'utf16'））
除例外情况外：
通过

这在大多数情况下都有效，但有时它无法写入，此时会出现异常，我尝试了所有可行的编码，但没有成功：

updating the sn
Met exception.  Changing encoding to cp1252
cp1252 did'nt work.  Changing encoding to utf-8
Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
    file.write(result)
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scraper.py", line 79, in <module>
    get_latest(entries[0], int(num), entries[1])
  File "scraper.py", line 56, in get_latest
    update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
    file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes

更新序列号
我遇到了一个例外。将编码更改为cp1252
cp1252不起作用。将编码更改为utf-8
回溯（最近一次呼叫最后一次）：
文件“C:\Users\Joseph\Desktop\SN Script\update\u files.py”，第145行，在update\u SN中
file.write（结果）
文件“C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py”，编码中的第19行
返回codecs.charmap\u encode（输入、自身错误、编码表）[0]
UnicodeEncodeError:“charmap”编解码器无法对位置4006-4007中的字符进行编码：字符映射到
在处理上述异常期间，发生了另一个异常：
回溯（最近一次呼叫最后一次）：
文件“C:\Users\Joseph\Desktop\SN Script\update\u files.py”，第149行，在update\u SN中
file.write（结果（'cp1252'））
TypeError:“str”对象不可调用
在处理上述异常期间，发生了另一个异常：
回溯（最近一次呼叫最后一次）：
文件“scraper.py”，第79行，在
获取最新的（条目[0]、int（num）、条目[1]）
文件“scraper.py”，第56行，在get\u latest中
更新文件。更新（文件更新，数据['number']，数据['table']，数据['title']））
文件“C:\Users\Joseph\Desktop\SN Script\update\u files.py”，第152行，在update\u SN中
file.write（result.encode（'utf8'））
TypeError:write（）参数必须是str，而不是bytes

有没有人能给我一些关于如何更好地处理编码不一致的html数据的建议

只是出于好奇，这行代码是否是一个打字错误

文件。write（result（'cp1252'））

？似乎缺少

.encode

方法

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

如果将代码修改为：

file.write（result.encode（'cp1252'））

我曾经有过一次写文件和编码问题，并通过以下线程酝酿了我自己的解决方案：

通过将

html.parser

解析模式更改为

html5lib

，我的问题得以解决。由于HTML标记格式错误，我使用root解决了问题，并使用

html5lib

解析器解决了问题。这是

BeautifulSoup

提供的每个解析器的

希望这有帮助，只是出于好奇，这行代码是一个打字错误

文件。write（result（'cp1252'））

？似乎缺少

.encode

方法

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

如果将代码修改为：

file.write（result.encode（'cp1252'））

我曾经有过一次写文件和编码问题，并通过以下线程酝酿了我自己的解决方案：

通过将

html.parser

解析模式更改为

html5lib

，我的问题得以解决。由于HTML标记格式错误，我使用root解决了问题，并使用

html5lib

解析器解决了问题。这是

BeautifulSoup

提供的每个解析器的

希望这有助于

在代码中，您以文本模式打开文件，但随后尝试写入字节（

str.encode

返回字节），因此Python会引发异常：

TypeError: write() argument must be str, not bytes

如果要写入字节，应以二进制模式打开文件

BeautifulSoup检测文档的编码（如果是字节），并自动将其转换为字符串。我们可以使用

.original_encoding

访问编码，并在写入文件时使用它对内容进行编码。比如说,

soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii

with open('my.file', 'wb+') as file:
    file.write(data.encode(encoding))

或者，您可以在文本模式下打开文件，并在

open

中设置编码（而不是编码内容），但请注意，此选项在Python2中不可用。

在代码中，您以文本模式打开文件，但随后尝试写入字节（

str.encode

返回字节）因此Python抛出了一个异常：

TypeError: write() argument must be str, not bytes

如果要写入字节，应以二进制模式打开文件

BeautifulSoup检测文档的编码（如果是字节），并自动将其转换为字符串。我们可以使用

.original_encoding

访问编码，并在写入文件时使用它对内容进行编码。比如说,

soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii

with open('my.file', 'wb+') as file:
    file.write(data.encode(encoding))

或者，