Html BeautifulSoup解析和刮取的编码问题_Html_Python 3.x_Encoding_Web Scraping_Beautifulsoup

Html BeautifulSoup解析和刮取的编码问题

html python-3.x encoding web-scraping

Html BeautifulSoup解析和刮取的编码问题,html,python-3.x,encoding,web-scraping,beautifulsoup,Html,Python 3.x,Encoding,Web Scraping,Beautifulsoup,利用Python3、BeautifulSoup和极简正则表达式，我试图从这个网页上删除文本：我已经成功地将它的html提取到一个文件中。事实上，我在这个网站上几乎所有的总统演讲都是这样做的；我有247个（258个可能的）演讲html保存在我的电脑上我只提取每页文本的代码如下所示： import re from bs4 import BeautifulSoup with open('scan_here.txt') as reference: #'scan_here.txt' i

利用Python3、BeautifulSoup和极简正则表达式，我试图从这个网页上删除文本：

我已经成功地将它的html提取到一个文件中。事实上，我在这个网站上几乎所有的总统演讲都是这样做的；我有247个（258个可能的）演讲html保存在我的电脑上

我只提取每页文本的代码如下所示：

import re
from bs4 import BeautifulSoup

with open('scan_here.txt') as reference:       #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
    for line in reference:
        line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
        line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
        f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
        doc = f.read()
        soup = BeautifulSoup(doc, 'html.parser')
        for speech in soup.select('span.display-text'):
            final_speech = str(speech)
            print(final_speech)

with open(file,'r',encoding='windows-1251') as f:
  text = f.read()

利用此代码，我得到以下错误消息：

Traceback (most recent call last):
  File "extract_individual_speeches.py", line 11, in <module>
    doc = f.read()
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 56443: invalid start byte

回溯（最近一次呼叫最后一次）：
文件“extract_individual_Speechs.py”，第11行，在
doc=f.read（）
文件“/usr/lib/python3.4/codecs.py”，第319行，解码中
（结果，消耗）=自身缓冲区解码（数据，自身错误，最终）
UnicodeDecodeError:“utf-8”编解码器无法解码位置56443中的字节0x97:无效的开始字节

我知道这是一个解码错误，并尝试在其他html文件上运行此代码，而不仅仅是在“scan_text.txt”文件名列表中出现的第一个文件。同样的错误，所以我认为这是html文件的局部编码问题

我认为问题可能在于html的第三行，它对我的所有html文件都有相同的编码：

<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">

什么是windows-1251？我想这就是问题所在。我查了一下，发现有一些windows-1251到UTF-8的转换器，但我没有看到一个能很好地与Python配合使用的

我发现了这个，但我不确定如何将它与现有代码集成

非常感谢您在这个问题上提供的任何帮助，TIA。

'windows-1251'是一种标准的windows编码。你需要的是UTF-8。您可以在打开文件时定义编码

试着这样做：

import re
from bs4 import BeautifulSoup

with open('scan_here.txt') as reference:       #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
    for line in reference:
        line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
        line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
        f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
        doc = f.read()
        soup = BeautifulSoup(doc, 'html.parser')
        for speech in soup.select('span.display-text'):
            final_speech = str(speech)
            print(final_speech)

with open(file,'r',encoding='windows-1251') as f:
  text = f.read()

或：

您还可以使用编解码器：

import codecs
f = codecs.open(file,'r','windows-1251').read()
codecs.open(file,'w','UTF-8').write(f)

嗨，彼得，谢谢你的回答。我甚至不知道这是否与html文件的编码有关，因为当我手动将其保存为UTF-8时，我会收到相同的错误消息。请从文件中删除BOM，然后尝试text=text.encode（encoding='UTF-8'，errors='replace'）哦，我明白了，详细说明字节序列的格式不同。我没有用UTF-8 BOM保存它，只是普通的UTF-8。