Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/73.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3使用BeautifulSoup刮取网页导致UnicodeDecodeError_Python_Html_Utf 8_Beautifulsoup - Fatal编程技术网

Python 3使用BeautifulSoup刮取网页导致UnicodeDecodeError

Python 3使用BeautifulSoup刮取网页导致UnicodeDecodeError,python,html,utf-8,beautifulsoup,Python,Html,Utf 8,Beautifulsoup,我正在做一个私人项目。我正在尝试解析一个网页。不幸的是,我没有意识到,刮掉一个网站会使你的网页访问被暂停。我使用hide.me制作了网站的本地副本,但显然它添加了一些信息,这使得BeautifulSoup很难阅读。这是我的代码: def pull_safe(location): url = (os.getcwd())+'/HTML_SOURCES/'+location page = open(url,encoding="ascii") soup = BeautifulSoup(page, "ht

我正在做一个私人项目。我正在尝试解析一个网页。不幸的是,我没有意识到,刮掉一个网站会使你的网页访问被暂停。我使用hide.me制作了网站的本地副本,但显然它添加了一些信息,这使得BeautifulSoup很难阅读。这是我的代码:

def pull_safe(location):
url = (os.getcwd())+'/HTML_SOURCES/'+location
page = open(url,encoding="ascii")
soup = BeautifulSoup(page, "html.parser", exclude_encodings=["ascii"])
hospital = list()
templist = list()
tempcount = 0
for td in soup.find('div', {'class':'report'}).parent.find_all('td'):
    if tempcount !=5:
        templist.append(td.text)
        tempcount+=1
    else:
        templist.append(td.text)
        hospital.append(templist)
        templist = list()
        tempcount = 0
return hospital
这是我得到的一个例外:

Traceback (most recent call last):


File "/home/memeputer/Documents/Projects/NYC Hospital Bed count/main.py", line 51, in <module>
    g = pull_safe(item)
  File "/home/memeputer/Documents/Projects/NYC Hospital Bed count/main.py", line 17, in pull_safe
    soup = BeautifulSoup(page, "html.parser", exclude_encodings=["utf-8"])
  File "/home/memeputer/Documents/Projects/NYC Hospital Bed count/venv/lib/python3.8/site-packages/bs4/__init__.py", line 286, in __init__
    markup = markup.read()
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 59903: invalid start byte
回溯(最近一次呼叫最后一次):
文件“/home/memeputer/Documents/Projects/NYC Hospital Bed count/main.py”,第51行,in
g=牵引保险箱(项目)
文件“/home/memeputer/Documents/Projects/NYC Hospital Bed count/main.py”,第17行,保险柜中
soup=BeautifulSoup(页面“html.parser”,排除编码=[“utf-8”])
文件“/home/memeputer/Documents/Projects/NYC Hospital Bed count/venv/lib/python3.8/site packages/bs4/_init__.py”,第286行,in__init__
markup=markup.read()
文件“/usr/lib/python3.8/codecs.py”,第322行,解码中
(结果,消耗)=自身缓冲区解码(数据,自身错误,最终)
UnicodeDecodeError:“utf-8”编解码器无法解码位置59903中的字节0xa9:无效的开始字节

感谢您的帮助。

它告诉您,在59903位置,它不是
utf-8
代码。是否可以向我们显示url?它是文案符号©。我真的不喜欢这样,我的解决方案是在本地保存它,然后删除它。有什么更好的处理方法吗?如果原始页面编码为
utf-8
,那么您需要将所有内容编码和解码为
utf-8
。将此(
page=open(url,encoding=“ascii”)
)更改为:
page=open(url,encoding=“utf-8”)