Compression 压缩一组大整数

Compression 压缩一组大整数,compression,integer,Compression,Integer,我有一组整数,我想用最紧凑的表示法。 我有以下限制/特征: 它是一个集合,或者换句话说,一个唯一整数的列表,其中的顺序并不重要 集合L的大小相对较小(通常为1000个元素) 整数在0和N-1之间服从均匀分布,其中N相对较大(例如2^32) 对压缩集元素的访问是随机的,但如果解压缩过程不是那么快,就可以了 显然,压缩应该是无损的 我已经尝试了一些方法,但我对结果并不满意,而且我相信存在更好的解决方案: 增量编码(排序,然后编码差异),或者第i个元素和i*N/L之间的排序,然后编码差异。两者都

我有一组整数,我想用最紧凑的表示法。 我有以下限制/特征:

  • 它是一个集合,或者换句话说,一个唯一整数的列表,其中的顺序并不重要
  • 集合L的大小相对较小(通常为1000个元素)
  • 整数在0和N-1之间服从均匀分布,其中N相对较大(例如2^32)
  • 对压缩集元素的访问是随机的,但如果解压缩过程不是那么快,就可以了
  • 显然,压缩应该是无损的
我已经尝试了一些方法,但我对结果并不满意,而且我相信存在更好的解决方案:

  • 增量编码(排序,然后编码差异),或者第i个元素和i*N/L之间的排序,然后编码差异。两者都给出了合理的结果,但不是很好,可能是因为N和L的典型大小。霍夫曼编码增量没有帮助,因为它们通常很大
  • 递归范围缩减()。这看起来很聪明,但对指数递减的整数效果最好,这里绝对不是这种情况
  • 这里关于stackoverflow的一些讨论与我的问题类似,但并不完全相同(,)
我很高兴听到你的任何想法。提前谢谢

更新:


事实证明,增量编码似乎接近最佳解决方案。对于集合中元素的其他分布,这可能有所不同。

如果整数是随机的、不相关的,并且在[0,2]-1]上确实遵循均匀分布定律,那么可能可以证明你不能从平凡的表示压缩数组。你的问题中我遗漏了什么吗

对于非随机数数组,我通常使用一个简单的算法。这是一个常用的算法,因为它适用于一般而不是完全随机的数组。在所有主要语言中都有具有可调整压缩级别的良好库,这当然是另一个优势

我使用deflate压缩物理传感器测量的小阵列(大约300到2000个32位整数),并获得70%的增益,但这是因为连续的传感器测量很少有很大的不同

要找到一个适合所有情况的更好的算法可能并不容易。大多数改进都来自于数字序列的特殊性


您可能还注意到,通过将多个集合压缩在一起,可以获得更好的压缩增益。当然,这可能会非常不方便,具体取决于您的应用程序。

您可以通过计数获得最佳效果的想法。(我希望stackoverflow允许像math.stackexchange这样的TeX方程。无论如何…)

因此,如您所说,如果选择是均匀分布的,那么对于这种特定情况,您平均可以希望的最佳压缩是2934字节。最佳比率是4000字节未编码表示的73.35%

组合(2^321000)
只是压缩算法可能输入的总数。如果这些输入是均匀分布的,则最佳编码是一个大整数,通过索引标识每个可能输入。每个大整数唯一标识一个输入。想象一下在一个大表中按索引查找输入。
ceiling(log(Combination(2^321000))/log(2))
是索引整数需要多少位

更新:

我找到了一种使用现成的压缩工具接近理论最佳状态的方法。我排序,应用增量编码,然后从中减去一(因为连续不同元素之间的增量至少为一)。然后技巧是我写出所有的高字节,然后是下一个最重要的字节,等等。delta减1的高字节往往是零,因此将许多零分组在一起,这是标准压缩实用程序所喜欢的。另外,下一组字节往往偏向于低值

对于示例(0..2^32-1中的1000个统一和不同的样本),通过
gzip-9
运行时,我平均得到3110字节,通过
xz-9
得到3098字节(xz使用与7zip相同的压缩LZMA)。这与理论上最好的平均值2934相当接近。此外,gzip的开销为18字节,xz的开销为24字节,无论是标题还是预告片。因此,与理论上最好的相比,
gzip-9
为3092字节,而
xz-9
为3074字节。比理论上最好的大约大5%

更新2:

我实现了对排列的直接编码,平均达到2974字节,仅比理论上的最佳值多出1%多一点。我使用将每个排列的索引编码为一个大整数。编码和解码的实际代码如下所示。我为
mpz.*
函数添加了注释从名字上看,他们在做什么算术运算可能并不明显

/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}
/*对集合[]中的成员进行低和高(低和高)之间的递归编码
他们自己已经被编码)。首先将中间成员编码为“mid”。
然后在low和mid之间,然后在mid之间对成员进行递归编码
而且很高*/
本地无效组合编码(mpz\U t包、mpz\U t基、,
常量无符号长*集,
整数低,整数高)
{
int mid;
/*计算中间位置——如果低和高之间没有任何值,
然后立即返回(同样在这种情况下,验证集合[]是否已排序)
(按升序排列)*/
中=(低+高)>>1;
如果(中间==低){
断言(设置[低]<设置[高]);
返回;
}
/*将[mid]编码设置到包中,并使用可能的
设置之间的[mid]值
/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}