Python 2.7 如何获取随机unicode字符串_Python 2.7_Encoding_Utf 8_Python Unicode

Python 2.7 如何获取随机unicode字符串

python-2.7 encoding utf-8

Python 2.7 如何获取随机unicode字符串,python-2.7,encoding,utf-8,python-unicode,Python 2.7,Encoding,Utf 8,Python Unicode,我正在测试一个基于REST的服务，其中一个输入是文本字符串。所以我从python代码中随机发送unicode字符串。到目前为止，我发送的unicode字符串在ascii范围内，所以一切正常现在，我试图发送超出ascii范围的字符，并收到一个编码错误。这是我的密码。我已经经历了这一切，但仍然无法控制自己 # coding=utf-8 import os, random, string import json junk_len = 512 junk = (("%%0%dX" % junk_l

我正在测试一个基于REST的服务，其中一个输入是文本字符串。所以我从python代码中随机发送unicode字符串。到目前为止，我发送的unicode字符串在ascii范围内，所以一切正常

现在，我试图发送超出ascii范围的字符，并收到一个编码错误。这是我的密码。我已经经历了这一切，但仍然无法控制自己

# coding=utf-8

import os, random, string
import json

junk_len = 512
junk =  (("%%0%dX" % junk_len) % random.getrandbits(junk_len * 8))

for i in xrange(1,5):
    if(len(junk) % 8 == 0):
        print u'decoding to hex'
        message = junk.decode("hex")

    print 'Hex chars %s' %message
    print u' '.join(message.encode("utf-8").strip())

第一行打印时没有任何问题，但如果不对其进行编码，我无法将其发送到REST服务。因此，第二行我试图将其编码为utf-8。这是失败的代码行，并显示以下消息

UnicodeDecodeError:“ascii”编解码器无法解码位置中的字节0x81 7：序号不在范围内（128）

UTF-8只允许某些位模式。您似乎在代码中使用UTF-8，因此需要符合允许的UTF-8模式

1 byte: 0b0xxxxxxx

2 byte: 0b110xxxxx 0b10xxxxxx

3 byte: 0b1110xxxx 0b10xxxxxx 0b10xxxxxx

4 byte: 0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx

在多字节模式中，第一个字节表示整个模式中的字节数，前导1后跟0和数据位

。非前导字节都遵循相同的模式：0b10xxxxxx，带有两个前导指示符位

和六个数据位

xxxxxx

通常，随机生成的字节不会遵循这些模式。您只能随机生成数据位

。

正如其他人所说，由于字节序列必须正确，所以很难生成有效的随机UTF-8字节

由于Unicode将所有字符映射到0x0000和0x10FFFF之间的数字，因此只需随机生成该范围内的数字即可获得有效的Unicode地址。将随机数传递给

unichar

（或Py3上的

char

），将在随机码点返回字符的Unicode字符串

然后，您需要做的就是让Python编码到UTF-8以创建有效的UTF-8序列

因为，在整个Unicode范围内存在许多空白和不可打印字符（由于字体限制），使用范围0000-D7FF，并在中返回字符，这将更有可能由您的系统打印。当编码为UTF-8时，每个字符最多可产生3字节序列

普通随机

import random

def random_unicode(length):
    # Create a list of unicode characters within the range 0000-D7FF
    random_unicodes = [unichr(random.randrange(0xD7FF)) for _ in xrange(0, length)] 
    return u"".join(random_unicodes)

my_random_unicode_str = random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')

import random

def unique_random_unicode(length):
    # create a list of unique randoms.
    random_ints = random.sample(xrange(0xD7FF), length)

    ## convert ints into Unicode characters
    # for each random int, generate a list of Unicode characters
    random_unicodes = [unichr(x) for x in random_ints]
    # join the list
    return u"".join(random_unicodes) 

my_random_unicode_str = unique_random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')

唯一随机数

import random

def random_unicode(length):
    # Create a list of unicode characters within the range 0000-D7FF
    random_unicodes = [unichr(random.randrange(0xD7FF)) for _ in xrange(0, length)] 
    return u"".join(random_unicodes)

my_random_unicode_str = random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')

import random

def unique_random_unicode(length):
    # create a list of unique randoms.
    random_ints = random.sample(xrange(0xD7FF), length)

    ## convert ints into Unicode characters
    # for each random int, generate a list of Unicode characters
    random_unicodes = [unichr(x) for x in random_ints]
    # join the list
    return u"".join(random_unicodes) 

my_random_unicode_str = unique_random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')

您能解释一下这是如何解释@rossum给出的4种有效unicode模式的吗？我会得到任何可能的utf-8字符串吗？这个解释听起来很有说服力，但我不知道sample/xrange如何只生成/任何有效的unicode字符。它不能解释4字节UTF-8，但这不是OP想要的。请参阅有关其工作原理的更新说明