Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/74.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Html BeautifulSoup解析和刮取的编码问题_Html_Python 3.x_Encoding_Web Scraping_Beautifulsoup - Fatal编程技术网

Html BeautifulSoup解析和刮取的编码问题

Html BeautifulSoup解析和刮取的编码问题,html,python-3.x,encoding,web-scraping,beautifulsoup,Html,Python 3.x,Encoding,Web Scraping,Beautifulsoup,利用Python3、BeautifulSoup和极简正则表达式,我试图从这个网页上删除文本: 我已经成功地将它的html提取到一个文件中。事实上,我在这个网站上几乎所有的总统演讲都是这样做的;我有247个(258个可能的)演讲html保存在我的电脑上 我只提取每页文本的代码如下所示: import re from bs4 import BeautifulSoup with open('scan_here.txt') as reference: #'scan_here.txt' i

利用Python3、BeautifulSoup和极简正则表达式,我试图从这个网页上删除文本:

我已经成功地将它的html提取到一个文件中。事实上,我在这个网站上几乎所有的总统演讲都是这样做的;我有247个(258个可能的)演讲html保存在我的电脑上

我只提取每页文本的代码如下所示:

import re
from bs4 import BeautifulSoup

with open('scan_here.txt') as reference:       #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
    for line in reference:
        line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
        line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
        f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
        doc = f.read()
        soup = BeautifulSoup(doc, 'html.parser')
        for speech in soup.select('span.display-text'):
            final_speech = str(speech)
            print(final_speech)
with open(file,'r',encoding='windows-1251') as f:
  text = f.read()
利用此代码,我得到以下错误消息:

Traceback (most recent call last):
  File "extract_individual_speeches.py", line 11, in <module>
    doc = f.read()
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 56443: invalid start byte
回溯(最近一次呼叫最后一次):
文件“extract_individual_Speechs.py”,第11行,在
doc=f.read()
文件“/usr/lib/python3.4/codecs.py”,第319行,解码中
(结果,消耗)=自身缓冲区解码(数据,自身错误,最终)
UnicodeDecodeError:“utf-8”编解码器无法解码位置56443中的字节0x97:无效的开始字节
我知道这是一个解码错误,并尝试在其他html文件上运行此代码,而不仅仅是在“scan_text.txt”文件名列表中出现的第一个文件。同样的错误,所以我认为这是html文件的局部编码问题

我认为问题可能在于html的第三行,它对我的所有html文件都有相同的编码:

<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">

什么是windows-1251?我想这就是问题所在。我查了一下,发现有一些windows-1251到UTF-8的转换器,但我没有看到一个能很好地与Python配合使用的

我发现了这个,但我不确定如何将它与现有代码集成


非常感谢您在这个问题上提供的任何帮助,TIA。

'windows-1251'是一种标准的windows编码。你需要的是UTF-8。您可以在打开文件时定义编码

试着这样做:

import re
from bs4 import BeautifulSoup

with open('scan_here.txt') as reference:       #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
    for line in reference:
        line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
        line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
        f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
        doc = f.read()
        soup = BeautifulSoup(doc, 'html.parser')
        for speech in soup.select('span.display-text'):
            final_speech = str(speech)
            print(final_speech)
with open(file,'r',encoding='windows-1251') as f:
  text = f.read()
或:

您还可以使用编解码器:

import codecs
f = codecs.open(file,'r','windows-1251').read()
codecs.open(file,'w','UTF-8').write(f)

嗨,彼得,谢谢你的回答。我甚至不知道这是否与html文件的编码有关,因为当我手动将其保存为UTF-8时,我会收到相同的错误消息。请从文件中删除BOM,然后尝试text=text.encode(encoding='UTF-8',errors='replace')哦,我明白了,详细说明字节序列的格式不同。我没有用UTF-8 BOM保存它,只是普通的UTF-8。