Python:编码错误-网页内容_Python_Unicode_Character Encoding

Python:编码错误-网页内容

python unicode character-encoding

Python:编码错误-网页内容,python,unicode,character-encoding,Python,Unicode,Character Encoding,我试图获取网页的内容并对其进行解析，而不是保存在mysql数据库中我实际上是为一个编码utf8的网页做的但当我尝试使用8859-9编码的网页时，我得到了一个错误获取页面内容的我的代码： def getcontent(url): opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Magic Browser')] opener.addheaders = [('Accept-Chars

我试图获取网页的内容并对其进行解析，而不是保存在mysql数据库中

我实际上是为一个编码utf8的网页做的

但当我尝试使用8859-9编码的网页时，我得到了一个错误

获取页面内容的我的代码：

def getcontent(url):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Magic Browser')]
    opener.addheaders = [('Accept-Charset', 'utf-8')]   
    #print chardet.detect(response).get('encoding)
    response = opener.open(url).read()
    opener.close()
    return response



url     = "http://www.meb.gov.tr/duyurular/index.asp?ID=4"
contentofpage = getcontent(url)
print contentofpage
print chardet.detect(contentofpage)
print contentofpage.encode("utf-8")

页面内容的输出： ... E�itim Teknolojileri Genel M�D�rl��

{'confidence'：0.7789909202570836，'encoding'：'ISO-8859-2'}
回溯（最近一次呼叫最后一次）：
文件“meb.py”，第18行，在
打印页面内容。编码（“utf-8”）
UnicodeDecodeError:“ascii”编解码器无法解码位置458处的字节0xee:序号不在范围内（128）

实际上这个页面是土耳其语页面，编码是8859-9

当我尝试使用默认编码时，我看到的都是�� 而不是一些chars。如何将页面内容转换为utf-8或土耳其语（iso-8859-9）

当我使用 unicode（页面内容）

它得到

回溯（最近一次呼叫最后一次）：文件“meb.py”，第20行，在打印unicode（页面内容） UnicodeDecodeError:“ascii”编解码器无法解码位置458处的字节0xee:序号不在范围内（128）

有什么帮助吗？

我想你想解码，而不是编码，因为它已经编码了

print contentofpage.decode("iso-8859-9")

产生如下样本：

Eğitim Teknolojileri Genel Müdürlüğü

我认为你想要解码，而不是编码，因为它已经被编码了

print contentofpage.decode("iso-8859-9")

产生如下样本：

Eğitim Teknolojileri Genel Müdürlüğü

打印contentofpage.decode（“iso-8859-9”）UnicodeEncodeError:“ascii”编解码器无法对位置458处的字符u'\xee'进行编码：序号不在范围（128）内。请确保在获取内容后直接解码

contentofpage=getcontent（url）

，然后

print contentofpage.decode（'iso-8859-9'）

。print contentofpage.decode（“iso-8859-9”）unicodeincoder错误：“ascii”编解码器无法对位置458处的字符u'\xee'进行编码：序号不在范围内（128）请确保在获取内容后直接进行解码

contentofpage=getcontent（url）

，然后

打印contentofpage.decode（'iso-8859-9'）

。