Html 刮取文件中的持久非utf-8字符_Html_Python 3.x_Web Scraping

Html 刮取文件中的持久非utf-8字符

html python-3.x web-scraping

Html 刮取文件中的持久非utf-8字符,html,python-3.x,web-scraping,Html,Python 3.x,Web Scraping,我目前正在学习网页抓取，我正在尝试保存网页的html版本： "https://www.wuxiaworld.co/Master-Hunter-K/1061716.html“通过使用漂亮的汤和请求模块每次我在html文件的stat中得到这些字符时，都是“而不是” 这是我的代码： from bs4 import BeautifulSoup import requests link = "https://www.wuxiaworld.co/Master-Hunter-K/1061716.h

我目前正在学习网页抓取，我正在尝试保存网页的html版本： "https://www.wuxiaworld.co/Master-Hunter-K/1061716.html“通过使用漂亮的汤和请求模块

每次我在html文件的stat中得到这些字符时，都是“而不是”

这是我的代码：

from bs4 import BeautifulSoup
import requests
link = "https://www.wuxiaworld.co/Master-Hunter-K/1061716.html"
html = requests.get(link,timeout = 2)
soup = BeautifulSoup(html.text,'html.parser')
with open("test.html","a",encoding ="utf-8-sig") as file:
    file.write(str(soup))

任何帮助都将不胜感激。谢谢。

好的，那就是

HTML

BOM

代表

字节顺序标记

让我们看看到底发生了什么，如下所示：

import requests

r = requests.get(
    'https://www.wuxiaworld.co/Master-Hunter-K/1061716.html')

print（r.headers['Content-Type']）

文本/html

让我们检查一下编码

打印（r.encoding）

ISO-8859-1

这是

HTML4

的默认值，但是

HTML5

的默认值是

UTF-8

所以现在我们需要使用请求使它变得明显

因此，我们将使用

r.encoding=r.u编码
打印（r.编码）

UTF-8-SIG

最终代码如下：

import requests


r = requests.get(
    'https://www.wuxiaworld.co/Master-Hunter-K/1061716.html')
r.encoding = r.apparent_encoding
with open('page.html', 'w', encoding='UTF-8-SIG') as pop:
    pop.write(r.text)

请在代码中添加解释，使其能够理解如何解决问题。@ArunVinoth您现在拥有它，因此您可以理解感谢Ahmed先生，我在过去的两个晚上一直在寻找解决方案，您帮助我们找到了正确的解决方案。