Python：用于通过utf-8字符串进行迭代->；什么'；迭代器的数据类型/编码是什么？_Python_Encoding_Utf 8_Character Encoding

Python：用于通过utf-8字符串进行迭代->；什么'；迭代器的数据类型/编码是什么？

python encoding utf-8 character-encoding

Python：用于通过utf-8字符串进行迭代->；什么'；迭代器的数据类型/编码是什么？,python,encoding,utf-8,character-encoding,Python,Encoding,Utf 8,Character Encoding,我有一个utf-8编码的字符串（主要是中文+一些英文），我想对它们进行字母计数。（类似于英语单词计数）所以我用 for letter in text: # text is a utf-8 encoded str 但我不确定我收到的是什么“信”文本“在屏幕上精细打印，并写入csv精细。但“文本中的字母”中的字母在屏幕和csv文件中看起来都崩溃了。我认为这肯定是与编码有关的一些问题，但是在这里和那里添加.encode（'utf-8'）并不能解决问题，并返回如下错误 UnicodeDecod

我有一个utf-8编码的字符串（主要是中文+一些英文），我想对它们进行字母计数。（类似于英语单词计数）

所以我用

for letter in text:    # text is a utf-8 encoded str

但我不确定我收到的是什么“信”文本“在屏幕上精细打印，并写入csv精细。但“文本中的字母”中的字母在屏幕和csv文件中看起来都崩溃了。我认为这肯定是与编码有关的一些问题，但是在这里和那里添加

.encode（'utf-8'）

并不能解决问题，并返回如下错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 0: ordinal not in range(128)

我的意思是，下面的代码没有返回错误，但是字母看起来都崩溃了，当我将.encode（'utf-8'）添加到打印

letter.encode（'utf-8'）

或

wcwriter.writerows（[[k.encode（'utf-8'），v]]）

UTF-8编码的字节可以很好地打印到屏幕上，也可以很好地写入到文件中，但这只是因为您的屏幕（终端或控制台）和读取文件的任何人都理解UTF-8

UTF-8编码在每个码点使用一个或多个字节。迭代不是一步一步地遍历数据代码点，而是一个字节一个字节地遍历。因此字符

“å”

被编码为UTF8，作为两个字节，C3和A5。尝试将这两个字节作为字母处理会产生问题：

>>> 'å'
'\xc3\xa5'
>>> for byte in 'å':
...     print repr(byte)
... 
'\xc3'
'\xa5'

您应该将解码为
unicode
值，以便Python知道由字节编码的代码点，或者知道已经使用unicode而不是编码的位置：
当您尝试对已编码的字节进行编码时，会导致异常。Python试图通过首先将字节解码为Unicode来提供帮助，这样它就可以遵从并编码回字节，但它只能使用默认的ASCII编码。这就是为什么在尝试使用
encode（）时会出现UnicodeDecodeError （请注意其中的解码）：而只在将其传递给CSV编写器时对其进行编码： twwriter.writerows([[td, text.encode('utf8'), name, loc, city, province]]) 您可能想研究一下Unicode和编码之间的区别，以及这与Python的关系：乔尔斯波尔斯基内德·巴奇尔德即使使用解码的utf-8，Python似乎也能将表情等分割成多个代码点。我使用以下函数来解决此问题： # ustr must be "decoded" unicode string, e.g. u"" def each_utf8_char(ustr, pointer=0): ustr = ustr.encode('utf-8') slen = len(ustr) char = ustr[pointer] if slen > pointer else False while char: charVal = ord(char) if charVal < 128: bytes = 1 elif charVal < 224: bytes = 2 elif charVal < 240: bytes = 3 elif charVal < 248: bytes = 4 elif charVal == 252: bytes = 5 else: bytes = 6 yield ustr[pointer:pointer+bytes].decode('utf-8') pointer += bytes char = ustr[pointer] if slen > pointer else False #ustr必须是“解码”的unicode字符串，例如u“ 定义每个字符（ustr，指针=0）： ustr=ustr.encode（'utf-8'） slen=len（美国贸易代表）如果slen>pointer else为False，则char=ustr[pointer] 而char： charVal=ord（char）如果charVal<128：字节=1 elif charVal<224：字节=2 elif charVal<240：字节=3 elif charVal<248：字节=4 elif charVal==252：字节=5 其他：字节=6 产生ustr[指针：指针+字节]。解码（'utf-8'）指针+=字节如果slen>pointer else为False，则char=ustr[pointer] 它是一个发电机，所以你可以这样使用它： my_ustr=u'使用的是什么Python版本？在2.x中，str 是字节列表，unicode 是字符列表，而在3.xstr 是字符列表，bytes 是字节列表。@TomHunt：这显然是Python 2；注意print 语句。其他线索：按照CSV 模块的建议，对CSV文件使用'rb' ，对Unicode数据使用u'字符串文字。我想我正在研究Python 2.7 >>> 'å'.encode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) for statuses in jres.get('statuses'): # jres is a json format response returned from a API call request text = statuses['text'] twwriter.writerows([[td, text.encode('utf8'), name, loc, city, province]]) # ustr must be "decoded" unicode string, e.g. u"" def each_utf8_char(ustr, pointer=0): ustr = ustr.encode('utf-8') slen = len(ustr) char = ustr[pointer] if slen > pointer else False while char: charVal = ord(char) if charVal < 128: bytes = 1 elif charVal < 224: bytes = 2 elif charVal < 240: bytes = 3 elif charVal < 248: bytes = 4 elif charVal == 252: bytes = 5 else: bytes = 6 yield ustr[pointer:pointer+bytes].decode('utf-8') pointer += bytes char = ustr[pointer] if slen > pointer else False