String Python3中测量字符串压缩比的最快方法_String_Python 3.x_Lzma

String Python3中测量字符串压缩比的最快方法

string python-3.x

String Python3中测量字符串压缩比的最快方法,string,python-3.x,lzma,String,Python 3.x,Lzma,我想通过使用LZMA压缩短字符串（大约一个单词长）并计算压缩比来估计它们的Kolmogorov复杂性在Python3中最有效的方法是什么？编辑：我不确定这是否是估计短字符串复杂度的好方法，因为要正确计算字符串的Kolmogorov（K-）复杂度，我们必须考虑用于解压字符串的程序的长度。程序的长度（我的Debian笔记本电脑上xz 5.1.0的67k）将压倒短字符串。因此，以下程序更接近于计算K复杂度上界： import lzma #For python 2.7 use backports.l

我想通过使用LZMA压缩短字符串（大约一个单词长）并计算压缩比来估计它们的Kolmogorov复杂性

在Python3中最有效的方法是什么？

编辑：

我不确定这是否是估计短字符串复杂度的好方法，因为要正确计算字符串的Kolmogorov（K-）复杂度，我们必须考虑用于解压字符串的程序的长度。程序的长度（我的Debian笔记本电脑上xz 5.1.0的67k）将压倒短字符串。因此，以下程序更接近于计算K复杂度上界：

import lzma #For python 2.7 use backports.lzma

program_length = 67000

def lzma_compression_ratio(test_string):
    bytes_in = bytes(test_string,'utf-8')
    bytes_out = lzma.compress(bytes_in)
    lbi = len(bytes_in)
    lbo = len(bytes_out)+program_length
    ratio = lbo/lbi
    message = '%d bytes compressed to %d bytes, ratio %0.3f'%(lbi,lbo,ratio)
    print(message)
    return ratio

test_string = 'a man, a plan, a canal: panama'
lzma_compression_ratio(test_string)

for n in range(22,25):
    test_string = 'a'*(2**n)
    lzma_compression_ratio(test_string)

下面的输出显示，对于长度为30 a的字符串，压缩比超过2000，对于长度为2^23的重复字符串，压缩比低于0.01。这些是技术上正确的K复杂度上界，但对于短字符串显然没有用处。程序“print（'a'*30）”的长度为13，这为字符串“aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

30 bytes compressed to 67024 bytes, ratio 2234.133
4194304 bytes compressed to 67395 bytes, ratio 0.016
8388608 bytes compressed to 68005 bytes, ratio 0.008
16777216 bytes compressed to 69225 bytes, ratio 0.004

原始答案：

@超级棒，这似乎有效，但我不知道这是否是最有效的：

import lzma

def lzma_compression_ratio(test_string):
    c = lzma.LZMACompressor()
    bytes_in = bytes(test_string,'utf-8')
    bytes_out = c.compress(bytes_in)
    return len(bytes_out)/len(bytes_in)

test_string = 'a man, a plan, a canal: panama'
compression_ratio = lzma_compression_ratio(test_string)
print(compression_ratio)

在

float

中，你将

len（…）

包装成什么？如果我重复最后3行（实际上要求它计算同一个字符串两次），它首先给出0.8，然后给出0.0。@Superbest，

float

是因为我是Python 2.7用户。问题被标记为

Python-3.x

你忘了包括LZMA模块的大小，这就是实际执行压缩/解压缩的程序：pUmm，您确实意识到压缩短字符串会增加指定它的位数，因为您需要包括解压缩算法？来自wikipedia：

只需使用某种方法压缩字符串s，用所选语言实现相应的解压器，将解压器连接到压缩字符串，并测量结果字符串的长度

@RishavKundu问题在于实现，而不是理论。