SQLite、python、unicode和非utf数据

SQLite、python、unicode和非utf数据,python,sqlite,unicode,utf-8,python-2.x,Python,Sqlite,Unicode,Utf 8,Python 2.x,我开始尝试使用python在sqlite中存储字符串,得到了以下消息: sqlite3.ProgrammingError:您必须 除非您 使用可以解释的文本工厂 8位bytestring(如文本)= str)。强烈建议 相反,你只需切换你的 Unicode字符串的应用程序 好的,我切换到Unicode字符串。然后我开始得到信息: sqlite3.0错误:无法 解码到UTF-8列“标签艺术家” 带有文本“Sigur Rós” 尝试从数据库检索数据时。我做了更多的研究,开始用utf8编码,但后来“S

我开始尝试使用python在sqlite中存储字符串,得到了以下消息:

sqlite3.ProgrammingError:您必须 除非您 使用可以解释的文本工厂 8位bytestring(如文本)= str)。强烈建议 相反,你只需切换你的 Unicode字符串的应用程序

好的,我切换到Unicode字符串。然后我开始得到信息:

sqlite3.0错误:无法 解码到UTF-8列“标签艺术家” 带有文本“Sigur Rós”

尝试从数据库检索数据时。我做了更多的研究,开始用utf8编码,但后来“Sigur Rós”看起来像“Sigur Rós”

注意:正如@John Machin所指出的,我的控制台设置为以“拉丁字母1”显示

有什么好处?阅读后,描述了与我所处的完全相同的情况,似乎建议忽略其他建议,毕竟使用8位bytestring

在开始这个过程之前,我对unicode和utf了解不多。在过去的几个小时里,我学到了很多东西,但我仍然不知道是否有一种方法可以正确地将“ó”从拉丁语-1转换为utf-8,而不是把它弄坏。如果没有,为什么sqlite“强烈建议”我将应用程序切换到unicode字符串


我将用一个总结和一些我在过去24小时里学到的所有东西的示例代码来更新这个问题,这样我的同事就可以有一个简单的(er)指南。如果我发布的信息有任何错误或误导,请告诉我,我会更新,或者你们中的一位资深人士可以更新


答案摘要

让我首先陈述我所理解的目标。如果您试图在各种编码之间进行转换,则处理这些编码的目的是了解源编码是什么,然后使用该源编码将其转换为unicode,然后将其转换为所需的编码。Unicode是一个基,编码是该基子集的映射。utf_8可以容纳unicode中的每个字符,但因为它们与拉丁语_1不在同一位置,所以用utf_8编码并发送到拉丁语_1控制台的字符串的外观与您期望的不一样。在python中,使用unicode和其他编码的过程如下所示:

str.decode('source_encoding').encode('desired_encoding')
或者如果str已经是unicode格式的

str.encode('desired_encoding')
对于sqlite,我实际上不想再次对其进行编码,我想对其进行解码并将其保留为unicode格式。在使用python中的unicode和编码时,您可能需要注意以下四点

  • 要使用的字符串的编码,以及要将其发送到的编码
  • 系统编码
  • 控制台编码
  • 源文件的编码
  • 阐述:

    (1) 当您从源读取字符串时,它必须具有某种编码,如拉丁字母1或utf字母8。在我的例子中,我从文件名中获取字符串,所以不幸的是,我可能得到任何类型的编码。Windows XP使用UCS-2(Unicode系统)作为其本机字符串类型,这在我看来似乎是欺骗。幸运的是,大多数文件名中的字符不会由一种以上的源编码类型组成,我认为我所有的字符要么完全是拉丁字母1,要么完全是utf字母8,要么只是普通的ascii码(这是两者的子集)。所以我只是阅读并解码它们,就好像它们仍然是拉丁语1或utf 8。不过,也有可能在Windows上的文件名中混合了拉丁字母1和utf字母8以及其他任何字符。有时这些字符会显示为方框,有时它们看起来只是破损,有时它们看起来是正确的(重音字符等等)。继续

    (2) Python有一个默认的系统编码,在Python启动时设置,在运行时不能更改。有关详细信息,请参阅。肮脏的摘要。。。这是我添加的文件:

    \# sitecustomize.py  
    \# this file can be anywhere in your Python path,  
    \# but it usually goes in ${pythondir}/lib/site-packages/  
    import sys  
    sys.setdefaultencoding('utf_8')  
    
    此系统编码是在不使用任何其他编码参数的情况下使用unicode(“str”)函数时使用的编码。换句话说,python尝试根据默认系统编码将“str”解码为unicode

    (3) 如果您使用的是IDLE或命令行python,我认为您的控制台将按照默认的系统编码显示。出于某种原因,我将pydev与eclipse一起使用,因此我必须进入项目设置,编辑测试脚本的启动配置属性,转到Common选项卡,并将控制台从latin-1更改为utf-8,以便直观地确认我所做的工作是否正常

    (4) 如果您想要一些测试字符串,例如

    test_str = "ó"
    
    在源代码中,您必须告诉python您在该文件中使用的是哪种编码。(仅供参考:当我输入错误的编码时,我必须按住ctrl-Z键,因为我的文件变得不可读。)在源代码文件的顶部放一行这样的代码很容易做到这一点:

    # -*- coding: utf_8 -*-
    
    如果您没有这些信息,python会尝试在默认情况下将代码解析为ascii,因此:

    SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
    
    一旦程序正常运行,或者如果您没有使用python控制台或任何其他控制台查看输出,那么您可能只关心列表中的#1。除非您需要查看输出和/或使用内置unicode()函数(没有任何编码参数)而不是string.decode()函数,否则系统默认和控制台编码没有那么重要。我写了一个演示函数,我将粘贴到这个巨大混乱的底部,我希望它能正确地演示我列表中的项目。下面是我通过demo函数运行字符“ó”时的一些输出,显示了各种方法如何响应作为输入的字符。对于此运行,我的系统编码和控制台输出都设置为utf_8:

    '�' = original char <type 'str'> repr(char)='\xf3'
    '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
    'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
    '?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
    

    非常感谢下面的答案,尤其是@John Machin的回答如此透彻。

    当然有。但是你的朋友
    'ó' = original char <type 'str'> repr(char)='\xf3'
    'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
    'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
    '?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
    
    '�' = original char <type 'str'> repr(char)='\xf3'
    '�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
    '�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
    '?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
    
    #!/usr/bin/env python
    # -*- coding: utf_8 -*-
    
    import os
    import sys
    
    def encodingDemo(str):
        validStrings = ()
        try:        
            print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
            validStrings += ((str,""),)
        except UnicodeEncodeError as ude:
            print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
            print ude
        try:
            x = unicode(str)
            print "unicode(str) = ",x
            validStrings+= ((x, " decoded into unicode by the default system encoding"),)
        except UnicodeDecodeError as ude:
            print "ERROR.  unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
            print "\tThe system encoding is set to {0}.  See error:\n\t".format(sys.getdefaultencoding()),  
            print ude
        except UnicodeEncodeError as uee:
            print "ERROR.  Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
            print uee
        try:
            x = str.decode('latin_1')
            print "str.decode('latin_1') =",x
            validStrings+= ((x, " decoded with latin_1 into unicode"),)
            try:        
                print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
                validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
            except UnicodeDecodeError as ude:
                print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8.  See error:\n\t",
                print ude
        except UnicodeDecodeError as ude:
            print "Something didn't work, probably because the string wasn't latin_1 encoded.  See error:\n\t",
            print ude
        except UnicodeEncodeError as uee:
            print "ERROR.  Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
            print uee
        try:
            x = str.decode('utf_8')
            print "str.decode('utf_8') =",x
            validStrings+= ((x, " decoded with utf_8 into unicode"),)
            try:        
                print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
            except UnicodeDecodeError as ude:
                print "str.decode('utf_8').encode('latin_1') didn't work.  The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1.  See error:\n\t",
                validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
                print ude
        except UnicodeDecodeError as ude:
            print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded.  See error:\n\t",
            print ude
        except UnicodeEncodeError as uee:
            print "ERROR.  Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",uee
    
        print
        print "Printing information about each character in the original string."
        for char in str:
            try:
                print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
            except UnicodeDecodeError as ude:
                print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
            except UnicodeEncodeError as uee:
                print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
                print uee    
    
            try:
                x = unicode(char)        
                print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
            except UnicodeDecodeError as ude:
                print "\t'?' = unicode(char) ERROR: {0}".format(ude)
            except UnicodeEncodeError as uee:
                print "\t'?' = unicode(char)  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
    
            try:
                x = char.decode('latin_1')
                print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
            except UnicodeDecodeError as ude:
                print "\t'?' = char.decode('latin_1')  ERROR: {0}".format(ude)
            except UnicodeEncodeError as uee:
                print "\t'?' = char.decode('latin_1')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
    
            try:
                x = char.decode('utf_8')
                print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
            except UnicodeDecodeError as ude:
                print "\t'?' = char.decode('utf_8')  ERROR: {0}".format(ude)
            except UnicodeEncodeError as uee:
                print "\t'?' = char.decode('utf_8')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
    
            print
    
    x = 'ó'
    encodingDemo(x)
    
    >>> print u'Sigur Rós'.encode('latin-1').decode('utf-8')
    Sigur Rós
    
    >>> oacute_latin1 = "\xF3"
    >>> oacute_unicode = oacute_latin1.decode('latin1')
    >>> oacute_utf8 = oacute_unicode.encode('utf8')
    >>> print repr(oacute_latin1)
    '\xf3'
    >>> print repr(oacute_unicode)
    u'\xf3'
    >>> import unicodedata
    >>> unicodedata.name(oacute_unicode)
    'LATIN SMALL LETTER O WITH ACUTE'
    >>> print repr(oacute_utf8)
    '\xc3\xb3'
    >>>
    
    >>> unicode("\xF3")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
    not in range(128)
    >>> "\xF3".decode('latin1')
    u'\xf3'
    >>>
    
    db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
    db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")
    
    conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
    
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    from __future__ import unicode_literals
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')