我如何使用Python解码这个utf-8字符串，它是在一个随机网站上挑选的，由Django ORM保存的？_Python_Django_Encoding_Utf 8

我如何使用Python解码这个utf-8字符串，它是在一个随机网站上挑选的，由Django ORM保存的？

python django encoding utf-8

我如何使用Python解码这个utf-8字符串，它是在一个随机网站上挑选的，由Django ORM保存的？,python,django,encoding,utf-8,Python,Django,Encoding,Utf 8,我解析了一个文件，并使用Django将其内容保存在数据库中。该网站100%是英文的，所以我天真地认为它一直都是ASCII，并愉快地将文本保存为unicode 你猜故事的其余部分：-）打印时，我会得到通常的编码错误： UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 48: ordinal not in range(128) 快速搜索告诉我u'\u2019'是'的UTF-8表示形式 re

我解析了一个文件，并使用Django将其内容保存在数据库中。该网站100%是英文的，所以我天真地认为它一直都是ASCII，并愉快地将文本保存为unicode

你猜故事的其余部分：-）

打印时，我会得到通常的编码错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 48: ordinal not in range(128)

快速搜索告诉我u'\u2019'是

的UTF-8表示形式

repr（字符串）

向我显示以下内容：

"u'his son\\u2019s friend'"

当然，我尝试了

django.utils.encoding.smart_str

和使用string.encode（'utf-8'）的更直接的方法，最终得到了一些可打印的东西。不幸的是，它在我的（linux UTF-8）终端中打印如下：

In [76]: repr(string.encode('utf-8'))
Out[76]: "'his son\\xe2\\x80\\x99s friend '"

In [77]: print string.encode('utf-8')
his son�s friend

不是我所期望的。我怀疑我对某些东西进行了双重编码，或者遗漏了一个要点

当然，文件的原始编码不会与文件一起使用。我想我可以阅读HTTP标题或询问网站管理员，但由于\u2019s看起来像UTF-8，我假设它是UTF-8。我可能会大错特错，如果我错了，请告诉我

解决方案显然值得赞赏，但深入解释原因以及如何避免这种情况再次发生更为重要。我经常被编码咬，这表明我仍然没有完全掌握这个主题。

也许我很幼稚，但是。。。您的问题不只是转义了unicode代码点的前导

原始字符串的行为类似于：

>>> s = u'his son\\u2019s friend'
>>> print(s)
his son\u2019s friend

但是移除转义

会得到：

>>> s = u'his son\u2019s friend'
>>> print(s)
his son’s friend

你很好。你有正确的数据。是的，原始数据是UTF-8（基于上下文u2019，作为“son”和“s”之间的撇号非常有意义）。奇怪的

？

错误字符可能只是意味着终端配置的字体没有该字符的标志符号（奇特的撇号）。没什么大不了的。数据在计算的地方是正确的。如果您感到紧张，请尝试一些不同的终端/操作系统组合（我在OSX上使用iTerm）。我花了很多时间向我的QA人员解释，可怕的

？

问号字符只是意味着他们的windows框上没有安装中文字体（在我的例子中，我们使用中文数据进行测试）。这里有一些评论

#Create a Python Unicode object
#(abstract code points, independent of any encoding)
#single backslash tells python we want to represent
#a code point by its unicode code point number, typed out with ASCII numbers
>>> s1 = u'his son\u2019s friend'

#If you just type it at the prompt,
#the interpreter does the equivalent of `print repr(s1)`
#and since repr means "show it like a string typed into a python source file",
#you get your ASCII escaped version back
>>> s1
u'his son\u2019s friend'
>>> print repr(s1)
u'his son\u2019s friend'

#This isn't ASCII, so encoding into ASCII generates your original
#error as expected
>>> s1.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character
 u'\u2019' in position 7: 
ordinal not in range(128)

# Encode in UTF-8 and now we have a string,
# which gets displayed as hex escapes.     
#Unicode code point 2019 looks like it gets 3 bytes in UTF-8 (yup, it does)
>>> s1.encode('utf-8')
'his son\xe2\x80\x99s friend'

#My terminal DOES have a different glyph (symbol) to use here,
#so it displays OK for me.
#Note that my terminal has a different glyph for a normal ASCII apostrophe
#(straight vertical)
>>> print s1
his son’s friend
>>> repr(s1)
"u'his son\\u2019s friend'"
>>> str(s1.encode('utf-8'))
'his son\xe2\x80\x99s friend'

#创建Python Unicode对象
#（抽象代码点，独立于任何编码）
#单反斜杠告诉python我们想要表示
#由其unicode代码点编号组成的代码点，用ASCII数字键入
>>>s1=你是他的儿子\u2019的朋友
#如果您只是在提示符处键入，
#解释器执行“print repr（s1）”的等效操作`
#由于repr的意思是“像python源文件中键入的字符串一样显示它”，
#你得到你的ASCII转义版本回来
>>>s1
你是他的儿子\u2019的朋友
>>>打印报告（s1）
你是他的儿子\u2019的朋友
#这不是ASCII码，因此将其编码为ASCII码将生成原始的
#错误如预期
>>>s1.编码（'ascii'）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
UnicodeEncodeError:“ascii”编解码器无法对字符进行编码
u'\u2019'位于位置7：
序号不在范围内（128）
#用UTF-8编码，现在我们有一个字符串，
#显示为十六进制转义。
#Unicode代码点2019看起来在UTF-8中有3个字节（是的，有）
>>>s1.编码（'utf-8'）
'他的儿子\xe2\x80\x99的朋友'
#我的终端在这里使用了不同的标志符号，
#所以对我来说它显示OK。
#请注意，我的终端对于普通ASCII撇号有不同的标志符号
#（笔直垂直）
>>>打印s1
他儿子的朋友
>>>报告员（s1）
“你是他的儿子\\u2019的朋友”
>>>str（s1.编码（'utf-8'））
'他的儿子\xe2\x80\x99的朋友'

另见：

另请参见字符2019（十六进制e28099，本页搜索“2019”）：

另请参见：

尝试调用python shell，如下所示：

python2 -S -i -c 'import sys;sys.setdefaultencoding("utf-8");import site'

然后：

然后默认编码是utf-8，应该可以很好地打印。

像MS Word这样的程序会将引号（和撇号）更改为非ascii值。可能是用户复制并粘贴了数据，它对我有用。你在用什么终端？您的字体中可能没有正确的标志符号。顺便说一句，u2019是正确的单引号代码点，而不是UTF-8表示形式。我在Ubuntu上使用Gnome终端。我没有转义“\”，这是使用“repr”时出现的情况，因为repr转义所有内容以允许您复制终端中的数据。@e-satis-那么问题就是Keith在评论中提出的问题。。。您使用的python版本/操作系统/安装是什么？我的是Ubuntu11.04 64位机器上的香草Python 2.7.1。Ubuntu11.04 32位机器上的香草Python 2.7.1:-）

$python2-S-i-c'import sys；系统设置默认编码（“utf-8”）；导入站点'>>>s=u'他的儿子\u2019s的朋友'>>>打印s.encode（“utf-8”）他的儿子�s的朋友：-（无论如何，python代码段+1谢谢Peter。我在编码方面遇到了很多麻烦，我只是认为我做错了什么：-）
>>> s = u'his son\u2019s friend'
>>> print s.encode("utf-8")
his son’s friend