Python 在数字行上设置范围的好方法是什么?

Python 在数字行上设置范围的好方法是什么?,python,encoding,Python,Encoding,我有一些这样的数据,其中第二个字段是第一个字段的概率,所以“0:0.017”表示0的概率为0.017。所有概率之和为1 我的问题是:如何从概率中选择“范围线”,以便找到每个字符的下界和上界?所以0应该是[0,0.017],[0.017,0.022]等等 我正在尝试实现算术编码 (0: 0.017, 1: 0.022, 2: 0.033, 3: 0.033, 4: 0.029, 5: 0.028, 6: 0.035, 7: 0.032, 8: 0.028, 9: 0.027, a: 0.019,

我有一些这样的数据,其中第二个字段是第一个字段的概率,所以“0:0.017”表示0的概率为0.017。所有概率之和为1

我的问题是:如何从概率中选择“范围线”,以便找到每个字符的下界和上界?所以0应该是[0,0.017],[0.017,0.022]等等

我正在尝试实现算术编码

(0: 0.017,
1: 0.022,
2: 0.033,
3: 0.033,
4: 0.029,
5: 0.028,
6: 0.035,
7: 0.032,
8: 0.028,
9: 0.027,
a: 0.019,
b: 0.022,
c: 0.029,
d: 0.03,
e: 0.028,
f: 0.035,
g: 0.026,
h: 0.037,
i: 0.029,
j: 0.025,
k: 0.025,
l: 0.037,
m: 0.025,
n: 0.023,
o: 0.026,
p: 0.035,
q: 0.033,
r: 0.031,
s: 0.023,
t: 0.022,
u: 0.038,
v: 0.022,
w: 0.016,
x: 0.026,
y: 0.021,
z: 0.033,)
编辑*


nvm我明白了,只是把愚蠢的数学搞砸了…感谢所有的输入!!!

创建字典,键是你的字符,值是定义下限和上限的一对

prev_p = 0
bounds = {}
for line in open(a_file):
   character, p = parse_the_line(line)
   bounds[character] = (prev_p, p)
   prev_p = p

创建字典,键是您的字符,值是定义下限和上限的一对

prev_p = 0
bounds = {}
for line in open(a_file):
   character, p = parse_the_line(line)
   bounds[character] = (prev_p, p)
   prev_p = p

将数据转换为python只是一个练习:

>>> corpus = [('0', 0.017), ('1', 0.022), ('2', 0.033), ('3', 0.033), ('4', 0.029),
...           ('5', 0.028), ('6', 0.035), ('7', 0.032), ('8', 0.028), ('9', 0.027),
...           ('a', 0.019), ('b', 0.022), ('c', 0.029), ('d', 0.030), ('e', 0.028),
...           ('f', 0.035), ('g', 0.026), ('h', 0.037), ('i', 0.029), ('j', 0.025),
...           ('k', 0.025), ('l', 0.037), ('m', 0.025), ('n', 0.023), ('o', 0.026),
...           ('p', 0.035), ('q', 0.033), ('r', 0.031), ('s', 0.023), ('t', 0.022),
...           ('u', 0.038), ('v', 0.022), ('w', 0.016), ('x', 0.026), ('y', 0.021),
...           ('z', 0.033)]
创建累计金额:

>>> distribution = []
>>> total = 0.0
>>> for letter, frequency in corpus:
...     distribution.append(total)
...     total += frequency
... 
实际上,使用这种数据是
bisect
模块的主要任务

>>> import bisect, random
>>> def random_letter():
...     value = random.random()
...     index = bisect.bisect(distribution, value) - 1
...     return corpus[index][0]
... 
>>> [random_letter() for n in range(10)]  # doctest: +SKIP
['d', '6', 'p', 'c', '8', 'f', '7', 'm', 'z', '7']

将数据转换为python只是一个练习:

>>> corpus = [('0', 0.017), ('1', 0.022), ('2', 0.033), ('3', 0.033), ('4', 0.029),
...           ('5', 0.028), ('6', 0.035), ('7', 0.032), ('8', 0.028), ('9', 0.027),
...           ('a', 0.019), ('b', 0.022), ('c', 0.029), ('d', 0.030), ('e', 0.028),
...           ('f', 0.035), ('g', 0.026), ('h', 0.037), ('i', 0.029), ('j', 0.025),
...           ('k', 0.025), ('l', 0.037), ('m', 0.025), ('n', 0.023), ('o', 0.026),
...           ('p', 0.035), ('q', 0.033), ('r', 0.031), ('s', 0.023), ('t', 0.022),
...           ('u', 0.038), ('v', 0.022), ('w', 0.016), ('x', 0.026), ('y', 0.021),
...           ('z', 0.033)]
创建累计金额:

>>> distribution = []
>>> total = 0.0
>>> for letter, frequency in corpus:
...     distribution.append(total)
...     total += frequency
... 
实际上,使用这种数据是
bisect
模块的主要任务

>>> import bisect, random
>>> def random_letter():
...     value = random.random()
...     index = bisect.bisect(distribution, value) - 1
...     return corpus[index][0]
... 
>>> [random_letter() for n in range(10)]  # doctest: +SKIP
['d', '6', 'p', 'c', '8', 'f', '7', 'm', 'z', '7']
从这一点上,您可以了解如何处理您的值。我记录了代码(以一种非常简单的方式处理输入文件)。但是
my_列表
现在干净且格式良好,带有
string
(值)和
float
(频率)。希望这对您有所帮助

上面代码的输出:

[['0', 0.017], ['1', 0.022], ['2', 0.033], ['3', 0.033], 
['4', 0.029], ['5', 0.028], ['6', 0.035], ['7', 0.032], 
['8', 0.028], ['9', 0.027], ['a', 0.019], ['b', 0.022], 
['c', 0.029], ['d', 0.03], ['e', 0.028], ['f', 0.035], 
['g', 0.026], ['h', 0.037], ['i', 0.029], ['j', 0.025], 
['k', 0.025], ['l', 0.037], ['m', 0.025], ['n', 0.023], 
['o', 0.026], ['p', 0.035], ['q', 0.033], ['r', 0.031], 
['s', 0.023], ['t', 0.022], ['u', 0.038], ['v', 0.022], 
['w', 0.016], ['x', 0.026], ['y', 0.021], ['z', 0.033]]
[0.0, 0.017, 0.022, 0.033, 0.033, 0.029, 0.028, 0.035, 0.032, 
0.028, 0.027, 0.019, 0.022, 0.029, 0.03, 0.028, 0.035, 0.026, 
0.037, 0.029, 0.025, 0.025, 0.037, 0.025, 0.023, 0.026, 0.035, 
0.033, 0.031, 0.023, 0.022, 0.038, 0.022, 0.016, 0.026, 0.021, 
0.033, 1.0]
然后

# Took a page out of TokenMacGuy, credit to him
distribution = []
distribution.append(0.00)  
total = 0.0 # Create a float here

for entry in my_list:
    distribution.append(entry[1])
    total += frequency
    total = round(total, 3) # Rounding to 2 decimal points

distribution.append(1.00) # Missing the 1.00 value
print(distribution) # Print to check
的输出在这里:

[['0', 0.017], ['1', 0.022], ['2', 0.033], ['3', 0.033], 
['4', 0.029], ['5', 0.028], ['6', 0.035], ['7', 0.032], 
['8', 0.028], ['9', 0.027], ['a', 0.019], ['b', 0.022], 
['c', 0.029], ['d', 0.03], ['e', 0.028], ['f', 0.035], 
['g', 0.026], ['h', 0.037], ['i', 0.029], ['j', 0.025], 
['k', 0.025], ['l', 0.037], ['m', 0.025], ['n', 0.023], 
['o', 0.026], ['p', 0.035], ['q', 0.033], ['r', 0.031], 
['s', 0.023], ['t', 0.022], ['u', 0.038], ['v', 0.022], 
['w', 0.016], ['x', 0.026], ['y', 0.021], ['z', 0.033]]
[0.0, 0.017, 0.022, 0.033, 0.033, 0.029, 0.028, 0.035, 0.032, 
0.028, 0.027, 0.019, 0.022, 0.029, 0.03, 0.028, 0.035, 0.026, 
0.037, 0.029, 0.025, 0.025, 0.037, 0.025, 0.023, 0.026, 0.035, 
0.033, 0.031, 0.023, 0.022, 0.038, 0.022, 0.016, 0.026, 0.021, 
0.033, 1.0]
最后,为了输出最终的结果:没有什么特别的,我使用了
模式
格式
,让它们看起来更漂亮。这基本上是按照ninjagecko的方法来计算的。我确实需要在分布中填充0.00和1.00,因为计算没有显示出来。对于ard实施之后我们找出如何进行概率计算

pattern = '{0}: [{1:1.3f}, {2:1.3f})'
count = 1 # a counter to keep track of the index

pre_p = distribution[0] 
p = distribution[1]

# Here we will print it out at the end in the format you said in the question
for entry in my_list:
    print(pattern.format(entry[0], pre_p, p))
    pre_p += distribution[count]
    p += distribution[count+1]
    count = count + 1
输出:

0: [0.000, 0.017)
1: [0.017, 0.039)
2: [0.039, 0.072)
3: [0.072, 0.105)
4: [0.105, 0.134)
5: [0.134, 0.162)
6: [0.162, 0.197)
7: [0.197, 0.229)
8: [0.229, 0.257)
9: [0.257, 0.284)
a: [0.284, 0.303)
b: [0.303, 0.325)
c: [0.325, 0.354)
d: [0.354, 0.384)
e: [0.384, 0.412)
f: [0.412, 0.447)
g: [0.447, 0.473)
h: [0.473, 0.510)
i: [0.510, 0.539)
j: [0.539, 0.564)
k: [0.564, 0.589)
l: [0.589, 0.626)
m: [0.626, 0.651)
n: [0.651, 0.674)
o: [0.674, 0.700)
p: [0.700, 0.735)
q: [0.735, 0.768)
r: [0.768, 0.799)
s: [0.799, 0.822)
t: [0.822, 0.844)
u: [0.844, 0.882)
v: [0.882, 0.904)
w: [0.904, 0.920)
x: [0.920, 0.946)
y: [0.946, 0.967)
z: [0.967, 1.000)
完整资料来源如下:

从这一点上,您可以了解如何处理您的值。我记录了代码(以一种非常简单的方式处理输入文件)。但是
my_列表
现在干净且格式良好,带有
string
(值)和
float
(频率)。希望这对您有所帮助

上面代码的输出:

[['0', 0.017], ['1', 0.022], ['2', 0.033], ['3', 0.033], 
['4', 0.029], ['5', 0.028], ['6', 0.035], ['7', 0.032], 
['8', 0.028], ['9', 0.027], ['a', 0.019], ['b', 0.022], 
['c', 0.029], ['d', 0.03], ['e', 0.028], ['f', 0.035], 
['g', 0.026], ['h', 0.037], ['i', 0.029], ['j', 0.025], 
['k', 0.025], ['l', 0.037], ['m', 0.025], ['n', 0.023], 
['o', 0.026], ['p', 0.035], ['q', 0.033], ['r', 0.031], 
['s', 0.023], ['t', 0.022], ['u', 0.038], ['v', 0.022], 
['w', 0.016], ['x', 0.026], ['y', 0.021], ['z', 0.033]]
[0.0, 0.017, 0.022, 0.033, 0.033, 0.029, 0.028, 0.035, 0.032, 
0.028, 0.027, 0.019, 0.022, 0.029, 0.03, 0.028, 0.035, 0.026, 
0.037, 0.029, 0.025, 0.025, 0.037, 0.025, 0.023, 0.026, 0.035, 
0.033, 0.031, 0.023, 0.022, 0.038, 0.022, 0.016, 0.026, 0.021, 
0.033, 1.0]
然后

# Took a page out of TokenMacGuy, credit to him
distribution = []
distribution.append(0.00)  
total = 0.0 # Create a float here

for entry in my_list:
    distribution.append(entry[1])
    total += frequency
    total = round(total, 3) # Rounding to 2 decimal points

distribution.append(1.00) # Missing the 1.00 value
print(distribution) # Print to check
的输出在这里:

[['0', 0.017], ['1', 0.022], ['2', 0.033], ['3', 0.033], 
['4', 0.029], ['5', 0.028], ['6', 0.035], ['7', 0.032], 
['8', 0.028], ['9', 0.027], ['a', 0.019], ['b', 0.022], 
['c', 0.029], ['d', 0.03], ['e', 0.028], ['f', 0.035], 
['g', 0.026], ['h', 0.037], ['i', 0.029], ['j', 0.025], 
['k', 0.025], ['l', 0.037], ['m', 0.025], ['n', 0.023], 
['o', 0.026], ['p', 0.035], ['q', 0.033], ['r', 0.031], 
['s', 0.023], ['t', 0.022], ['u', 0.038], ['v', 0.022], 
['w', 0.016], ['x', 0.026], ['y', 0.021], ['z', 0.033]]
[0.0, 0.017, 0.022, 0.033, 0.033, 0.029, 0.028, 0.035, 0.032, 
0.028, 0.027, 0.019, 0.022, 0.029, 0.03, 0.028, 0.035, 0.026, 
0.037, 0.029, 0.025, 0.025, 0.037, 0.025, 0.023, 0.026, 0.035, 
0.033, 0.031, 0.023, 0.022, 0.038, 0.022, 0.016, 0.026, 0.021, 
0.033, 1.0]
最后,为了输出最终的结果:没有什么特别的,我使用了
模式
格式
,让它们看起来更漂亮。这基本上是按照ninjagecko的方法来计算的。我确实需要在分布中填充0.00和1.00,因为计算没有显示出来。对于ard实施之后我们找出如何进行概率计算

pattern = '{0}: [{1:1.3f}, {2:1.3f})'
count = 1 # a counter to keep track of the index

pre_p = distribution[0] 
p = distribution[1]

# Here we will print it out at the end in the format you said in the question
for entry in my_list:
    print(pattern.format(entry[0], pre_p, p))
    pre_p += distribution[count]
    p += distribution[count+1]
    count = count + 1
输出:

0: [0.000, 0.017)
1: [0.017, 0.039)
2: [0.039, 0.072)
3: [0.072, 0.105)
4: [0.105, 0.134)
5: [0.134, 0.162)
6: [0.162, 0.197)
7: [0.197, 0.229)
8: [0.229, 0.257)
9: [0.257, 0.284)
a: [0.284, 0.303)
b: [0.303, 0.325)
c: [0.325, 0.354)
d: [0.354, 0.384)
e: [0.384, 0.412)
f: [0.412, 0.447)
g: [0.447, 0.473)
h: [0.473, 0.510)
i: [0.510, 0.539)
j: [0.539, 0.564)
k: [0.564, 0.589)
l: [0.589, 0.626)
m: [0.626, 0.651)
n: [0.651, 0.674)
o: [0.674, 0.700)
p: [0.700, 0.735)
q: [0.735, 0.768)
r: [0.768, 0.799)
s: [0.799, 0.822)
t: [0.822, 0.844)
u: [0.844, 0.882)
v: [0.882, 0.904)
w: [0.904, 0.920)
x: [0.920, 0.946)
y: [0.946, 0.967)
z: [0.967, 1.000)

完整资料来源于此:

您的数据是文本文件吗?或者这是某种数据结构?@George需要一种数据结构,我从随机字符/数字的文本文件中得到概率“0将是[0,0.017,[0.017,0.022)”-您的意思是“0将是[0,0.017],1将是[0.017,0.017+0.022],2将是[0.017+0.022,0.017+0.033”)在你试图描述累积概率之前,你不需要按相对频率对字符进行排序吗?@George是的,我只是按照这种方法。你的数据是在文本文件中吗?还是这是某种数据结构?@George需要一个数据结构,我从随机字符/数字的文本文件中获得概率“0将是[0,0.017”),[0.017,0.022)”-你的意思是“0将是[0,0.017],1将是[0.017,0.017+0.022],2将是[0.017+0.022,0.017+0.022+0.033)”-在你试图描述累积概率之前,你不需要按相对频率对字符进行排序吗?@George yes,我刚刚采用了这种方法。