Python zlib压缩输出可以避免使用某些字节值吗？_Python_Zlib

Python zlib压缩输出可以避免使用某些字节值吗？

python

Python zlib压缩输出可以避免使用某些字节值吗？,python,zlib,Python,Zlib,似乎输出zlib.compress使用了所有可能的字节值。例如，是否可以使用256字节值中的255个避免使用\n 请注意，我只是将python手册用作参考，但这个问题不是针对python的，而是针对具有zlib库的任何其他语言。不，这是不可能的。除了压缩数据本身，还有包含整数的标准化控制结构。这些整数可能意外导致任何8位字符在ByTestStream中结束您唯一的机会是将zlib bytestream编码为另一种格式，例如base64。因为这在zlib本身中是不可能的。您提到base64编码效

似乎输出zlib.compress使用了所有可能的字节值。例如，是否可以使用256字节值中的255个避免使用\n

请注意，我只是将python手册用作参考，但这个问题不是针对python的，而是针对具有zlib库的任何其他语言。

不，这是不可能的。除了压缩数据本身，还有包含整数的标准化控制结构。这些整数可能意外导致任何8位字符在ByTestStream中结束

您唯一的机会是将zlib bytestream编码为另一种格式，例如base64。

因为这在zlib本身中是不可能的。您提到base64编码效率太低，但使用转义字符对希望避免的字符（如换行符）进行编码非常容易

def encode(data):
    # order matters
    return data.replace(b'a', b'aa').replace(b'\n', b'ab')

def decode(data):
    def _foo():
        pair = False
        for b in data:
            if pair:
                # yield b'a' if b==b'a' else b'\n'
                yield 97 if b==97 else 10
                pair = False
            elif b==97:  # b'a'
                pair = True
            else:
                yield b
    return bytes(_foo())

这并不是世界上最有效的代码，您可能希望找到使用最少的字节来节省更多的空间，但它的可读性足够好，并说明了这一点。您可以无损地编码/解码，编码流将不会有任何换行符

def encode(data):
    # order matters
    return data.replace(b'a', b'aa').replace(b'\n', b'ab')

def decode(data):
    def _foo():
        pair = False
        for b in data:
            if pair:
                # yield b'a' if b==b'a' else b'\n'
                yield 97 if b==97 else 10
                pair = False
            elif b==97:  # b'a'
                pair = True
            else:
                yield b
    return bytes(_foo())

作为某种信心的衡量标准，您可以在小型ByTestRing上彻底检查这一点：

from itertools import *

all(
    bytes(p) == decode(encode(bytes(p)))
        for c in combinations_with_replacement(b'ab\nc', r=6)
        for p in permutations(c)
)

压缩的全部目的是尽可能减小尺寸。如果zlib或任何压缩器仅使用256字节值中的255个，则输出大小将至少增加0.07%

这对您来说可能非常好，因此您可以简单地对压缩输出或任何数据进行后处理，以删除一个特定的字节值，而牺牲一些扩展。最简单的方法是用两字节转义序列替换该字节。然后还需要用不同的两字节转义序列替换转义前缀。这将使数据平均扩大0.8%。这正是汉斯在这里的另一个回答中所提供的

如果成本太高，你可以做一些更复杂的事情，那就是解码一个固定的哈夫曼码，它编码255个概率相等的符号。先解码，然后对哈夫曼密码进行编码。输入是一个位序列，而不是字节序列，大多数情况下，您需要用一些零位填充输入以编码最后一个符号。哈夫曼码将一个符号转换为七位，将其他254个符号转换为八位。所以反过来说，它将把投入扩大不到0.1%。对于短消息来说，它会多一点，因为通常在最末端不到7位会被编码成一个符号

C语言的实现：

// Placed in the public domain by Mark Adler, 26 June 2020.

// Encode an arbitrary stream of bytes into a stream of symbols limited to 255
// values. In particular, avoid the \n (10) byte value. With -d, decode back to
// the original byte stream. Take input from stdin, and write output to stdout.

#include <stdio.h>
#include <string.h>

// Encode arbitrary bytes to a sequence of 255 symbols, which are written out
// as bytes that exclude the value '\n' (10). This encoding is actually a
// decoding of a fixed Huffman code of 255 symbols of equal probability. The
// output will be on average a little less than 0.1% larger than the input,
// plus one byte, assuming random input. This is intended to be used on
// compressed data, which will appear random. An input of all zero bits will
// have the maximum possible expansion, which is 14.3%, plus one byte.
int nolf_encode(FILE *in, FILE *out) {
    unsigned buf = 0;
    int bits = 0, ch;
    do {
        if (bits < 8) {
            ch = getc(in);
            if (ch != EOF) {
                buf |= (unsigned)ch << bits;
                bits += 8;
            }
            else if (bits == 0)
                break;
        }
        if ((buf & 0x7f) == 0) {
            buf >>= 7;
            bits -= 7;
            putc(0, out);
            continue;
        }
        int sym = buf & 0xff;
        buf >>= 8;
        bits -= 8;
        if (sym >= '\n' && sym < 128)
            sym++;
        putc(sym, out);
    } while (ch != EOF);
    return 0;
}

// Decode a sequence of symbols from a set of 255 that was encoded by
// nolf_encode(). The input is read as bytes that exclude the value '\n' (10).
// Any such values in the input are ignored and flagged in an error message.
// The sequence is decoded to the original sequence of arbitrary bytes. The
// decoding is actually an encoding of a fixed Huffman code of 255 symbols of
// equal probability.
int nolf_decode(FILE *in, FILE *out) {
    unsigned long lfs = 0;
    unsigned buf = 0;
    int bits = 0, ch;
    while ((ch = getc(in)) != EOF) {
        if (ch == '\n') {
            lfs++;
            continue;
        }
        if (ch == 0) {
            if (bits == 0) {
                bits = 7;
                continue;
            }
            bits--;
        }
        else {
            if (ch > '\n' && ch <= 128)
                ch--;
            buf |= (unsigned)ch << bits;
        }
        putc(buf, out);
        buf >>= 8;
    }
    if (lfs)
        fprintf(stderr, "nolf: %lu unexpected line feeds ignored\n", lfs);
    return lfs != 0;
}

// Encode (no arguments) or decode (-d) from stdin to stdout.
int main(int argc, char **argv) {
    if (argc == 1)
        return nolf_encode(stdin, stdout);
    else if (argc == 2 && strcmp(argv[1], "-d") == 0)
        return nolf_decode(stdin, stdout);
    fputs("nolf: unknown options (use -d to decode)\n", stderr);
    return 1;
}

base64不违背压缩流的原始目的。是否有一种方法可以将256个可能的字节流转换为255个可能的类型流，以便我保留“\n”用于自己的目的？Zlib complression不使用所有可能的“字符”，它使用所有可能的8位字节值，即0-255。从技术上讲，应该可以实现您自己的类似压缩方案，避免特定的值，但它不能与标准zlib ComplResence互换。Python的zip库是用Python实现的-源代码在那里，您可以创建自己的“user1424739lib压缩”。您可以使用某种转义序列替换压缩数据中的任何换行符-例如，将换行符替换为X1，将实际的Xs替换为X2，在解压缩之前在接收端反转这些替换。这与编程语言通过在引号前面加反斜杠，使引号包含在带引号的字符串文字中的基本思想相同。这不可避免地抵消了部分压缩—平均为1/128，如果压缩数据恰好完全由需要转义的字节组成，则最多为2。我认为您可以将输出转换为0-254之间，但不能轻松跳过0-255范围内的特定值。什么时候可以接受？@martineau好的。这似乎是一个合理的解决办法。能否提供一个python实现来转换zlib.compress的结果并将其转换回来？解码是否正常工作？如果原始输入包含ab怎么办？是的。您得到的aab被解码回ab。如果a后面跟a或b以外的任何东西，代码可能会返回错误，但它是自由的，并返回\n除a后面跟a以外的任何东西。@MarkAdler解码函数确实有一个错误，我在编辑中修复了该错误，ab是一个失败的例子。这是一个很好的观点，尽管解码会尝试给出一些答案，即使是无效的输入。啊，好的。我没有注意到评论在编辑之前。