SQLite、python、unicode和非utf数据
我开始尝试使用python在sqlite中存储字符串,得到了以下消息: sqlite3.ProgrammingError:您必须 除非您 使用可以解释的文本工厂 8位bytestring(如文本)= str)。强烈建议 相反,你只需切换你的 Unicode字符串的应用程序 好的,我切换到Unicode字符串。然后我开始得到信息: sqlite3.0错误:无法 解码到UTF-8列“标签艺术家” 带有文本“Sigur Rós” 尝试从数据库检索数据时。我做了更多的研究,开始用utf8编码,但后来“Sigur Rós”看起来像“Sigur Rós” 注意:正如@John Machin所指出的,我的控制台设置为以“拉丁字母1”显示 有什么好处?阅读后,描述了与我所处的完全相同的情况,似乎建议忽略其他建议,毕竟使用8位bytestring 在开始这个过程之前,我对unicode和utf了解不多。在过去的几个小时里,我学到了很多东西,但我仍然不知道是否有一种方法可以正确地将“ó”从拉丁语-1转换为utf-8,而不是把它弄坏。如果没有,为什么sqlite“强烈建议”我将应用程序切换到unicode字符串SQLite、python、unicode和非utf数据,python,sqlite,unicode,utf-8,python-2.x,Python,Sqlite,Unicode,Utf 8,Python 2.x,我开始尝试使用python在sqlite中存储字符串,得到了以下消息: sqlite3.ProgrammingError:您必须 除非您 使用可以解释的文本工厂 8位bytestring(如文本)= str)。强烈建议 相反,你只需切换你的 Unicode字符串的应用程序 好的,我切换到Unicode字符串。然后我开始得到信息: sqlite3.0错误:无法 解码到UTF-8列“标签艺术家” 带有文本“Sigur Rós” 尝试从数据库检索数据时。我做了更多的研究,开始用utf8编码,但后来“S
我将用一个总结和一些我在过去24小时里学到的所有东西的示例代码来更新这个问题,这样我的同事就可以有一个简单的(er)指南。如果我发布的信息有任何错误或误导,请告诉我,我会更新,或者你们中的一位资深人士可以更新
答案摘要 让我首先陈述我所理解的目标。如果您试图在各种编码之间进行转换,则处理这些编码的目的是了解源编码是什么,然后使用该源编码将其转换为unicode,然后将其转换为所需的编码。Unicode是一个基,编码是该基子集的映射。utf_8可以容纳unicode中的每个字符,但因为它们与拉丁语_1不在同一位置,所以用utf_8编码并发送到拉丁语_1控制台的字符串的外观与您期望的不一样。在python中,使用unicode和其他编码的过程如下所示:
str.decode('source_encoding').encode('desired_encoding')
或者如果str已经是unicode格式的
str.encode('desired_encoding')
对于sqlite,我实际上不想再次对其进行编码,我想对其进行解码并将其保留为unicode格式。在使用python中的unicode和编码时,您可能需要注意以下四点
\# sitecustomize.py
\# this file can be anywhere in your Python path,
\# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('utf_8')
此系统编码是在不使用任何其他编码参数的情况下使用unicode(“str”)函数时使用的编码。换句话说,python尝试根据默认系统编码将“str”解码为unicode
(3) 如果您使用的是IDLE或命令行python,我认为您的控制台将按照默认的系统编码显示。出于某种原因,我将pydev与eclipse一起使用,因此我必须进入项目设置,编辑测试脚本的启动配置属性,转到Common选项卡,并将控制台从latin-1更改为utf-8,以便直观地确认我所做的工作是否正常
(4) 如果您想要一些测试字符串,例如
test_str = "ó"
在源代码中,您必须告诉python您在该文件中使用的是哪种编码。(仅供参考:当我输入错误的编码时,我必须按住ctrl-Z键,因为我的文件变得不可读。)在源代码文件的顶部放一行这样的代码很容易做到这一点:
# -*- coding: utf_8 -*-
如果您没有这些信息,python会尝试在默认情况下将代码解析为ascii,因此:
SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
一旦程序正常运行,或者如果您没有使用python控制台或任何其他控制台查看输出,那么您可能只关心列表中的#1。除非您需要查看输出和/或使用内置unicode()函数(没有任何编码参数)而不是string.decode()函数,否则系统默认和控制台编码没有那么重要。我写了一个演示函数,我将粘贴到这个巨大混乱的底部,我希望它能正确地演示我列表中的项目。下面是我通过demo函数运行字符“ó”时的一些输出,显示了各种方法如何响应作为输入的字符。对于此运行,我的系统编码和控制台输出都设置为utf_8:
'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
非常感谢下面的答案,尤其是@John Machin的回答如此透彻。当然有。但是你的朋友
'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
#!/usr/bin/env python
# -*- coding: utf_8 -*-
import os
import sys
def encodingDemo(str):
validStrings = ()
try:
print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
validStrings += ((str,""),)
except UnicodeEncodeError as ude:
print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print ude
try:
x = unicode(str)
print "unicode(str) = ",x
validStrings+= ((x, " decoded into unicode by the default system encoding"),)
except UnicodeDecodeError as ude:
print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()),
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('latin_1')
print "str.decode('latin_1') =",x
validStrings+= ((x, " decoded with latin_1 into unicode"),)
try:
print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
except UnicodeDecodeError as ude:
print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t",
print ude
except UnicodeDecodeError as ude:
print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('utf_8')
print "str.decode('utf_8') =",x
validStrings+= ((x, " decoded with utf_8 into unicode"),)
try:
print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
except UnicodeDecodeError as ude:
print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t",
validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
print ude
except UnicodeDecodeError as ude:
print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee
print
print "Printing information about each character in the original string."
for char in str:
try:
print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
except UnicodeDecodeError as ude:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
except UnicodeEncodeError as uee:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
print uee
try:
x = unicode(char)
print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = unicode(char) ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('latin_1')
print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('utf_8')
print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
print
x = 'ó'
encodingDemo(x)
>>> print u'Sigur Rós'.encode('latin-1').decode('utf-8')
Sigur Rós
>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>
>>> unicode("\xF3")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>
db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")
conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')