Java：从字节数组中删除连续的零段_Java_Arrays_Regex

Java：从字节数组中删除连续的零段

java arrays regex

Java：从字节数组中删除连续的零段,java,arrays,regex,Java,Arrays,Regex,例如，假设我想从数组中删除所有长度超过3字节的0连续段 byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4}; byte r[] = magic(a); System.out.println(r); 结果 {1,2,3,0,1,2,3,4} 我想在Java中做一些类似正则表达式的事情，但是在字节数组而不是字符串上有什么东西可以帮助我内置（或者有一个好的第三方工具），或者我需要从头开始工作字符串是UTF-16，所以来回转换不是个好主意？至少这是浪费了很多开销。。。对吗？

例如，假设我想从数组中删除所有长度超过3字节的0连续段

byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4};
byte r[] = magic(a);
System.out.println(r);

结果

{1,2,3,0,1,2,3,4}

我想在Java中做一些类似正则表达式的事情，但是在字节数组而不是字符串上

有什么东西可以帮助我内置（或者有一个好的第三方工具），或者我需要从头开始工作

字符串是UTF-16，所以来回转换不是个好主意？至少这是浪费了很多开销。。。对吗？

Java正则表达式对字符序列进行操作-您可以包装现有的字节数组（您可能需要将其转换为char[]），并对其进行解释，然后在该数组上执行正则表达式？

我不认为正则表达式对实现您想要的功能有多大帮助。您可以做的一件事是对该字节数组进行编码，用空字符串替换“30”（读取三个0）的每一次出现，并解码最后一个字符串。Wikipedia有一个简单的Java实现。

regex不是这项工作的工具，相反，您需要从头开始实现它。

虽然有一个合理的库，但我所见过的没有人在它们上实现过通用的regexp库

我建议直接解决您的问题，而不是实现regexp库：）

如果您确实转换为字符串并返回，您可能找不到任何现有的编码为您的0字节提供往返。如果是这样的话，您必须编写自己的字节数组字符串转换器；不值得麻烦

byte[] a = {1,2,3,0,1,2,3,0,0,0,0,4};
String s0 = new String(a, "ISO-8859-1");
String s1 = s0.replaceAll("\\x00{4,}", "");
byte[] r = s1.getBytes("ISO-8859-1");

System.out.println(Arrays.toString(r)); // [1, 2, 3, 0, 1, 2, 3, 4]

我使用ISO-8859-1（拉丁文1），因为与其他编码不同

范围
```
0x00..0xFF
```
中的每个字节映射到一个有效字符，并且
这些字符中的每一个都具有与其拉丁1编码相同的数值

这意味着字符串的长度与原始字节数组的长度相同，您可以通过其数值将任何字节与

\xFF

构造匹配，并且可以将生成的字符串转换回字节数组，而不会丢失信息

我不会试图以字符串形式显示数据——尽管所有字符都是有效的，但其中许多是不可打印的。此外，避免在数据为字符串形式时对其进行操作；您可能无意中进行了一些转义序列替换或其他编码转换，但没有意识到这一点。事实上，我根本不建议你做这种事，但那不是你要求的

另外，请注意，这种技术不一定适用于其他编程语言或正则表达式风格。您必须单独测试每一个。我建议将字节数组转换为字符串，执行正则表达式，然后将其转换回。下面是一个工作示例：

public void testRegex() throws Exception {
    byte a[] = { 1, 2, 3, 0, 1, 2, 3, 0, 0, 0, 0, 4 };
    String s = btoa(a);
    String t = s.replaceAll("\u0000{4,}", "");
    byte b[] = atob(t);
    System.out.println(Arrays.toString(b));
}

private byte[] atob(String t) {
    char[] array = t.toCharArray();
    byte[] b = new byte[array.length];
    for (int i = 0; i < array.length; i++) {
        b[i] = (byte) Character.toCodePoint('\u0000', array[i]);
    }
    return b;
}

private String btoa(byte[] a) {
    StringBuilder sb = new StringBuilder();
    for (byte b : a) {
        sb.append(Character.toChars(b));
    }
    return sb.toString();
}

public void testRegex（）引发异常{
字节a[]={1,2,3,0,1,2,3,0,0,0,4}；
字符串s=btoa（a）；
字符串t=s.replaceAll（“\u0000{4，}”，“”）；
字节b[]=atob（t）；
System.out.println（Arrays.toString（b））；
}
专用字节[]atob（字符串t）{
char[]数组=t.toCharArray（）；
byte[]b=新字节[array.length]；
for（int i=0；i


对于更复杂的转换，我建议使用Lexer。JavaCC和ANTLR都支持解析/转换二进制文件
 尽管我怀疑reg ex是否是适合这项工作的工具，但如果您确实想使用它，我建议您只需在字节数组上实现CharSequence包装器。类似这样的东西（我只是直接写了这篇文章，没有编译…但是你明白了）
其他答案提出的使用正则表达式的实现比使用将字节从输入数组复制到输出数组的循环的简单实现慢8倍
/**
 * Remove four or more zero byte sequences from the input array.
 *  
 * @param inBytes the input array 
 * @return a new array with four or more zero bytes removed form the input array
 */
private static byte[] removeDuplicates(byte[] inBytes) {
    int size = inBytes.length;
    // Use an array with the same size in the first place
    byte[] newBytes = new byte[size];
    byte value;
    int newIdx = 0;
    int zeroCounter = 0;

    for (int i = 0; i < size; i++) {
        value = inBytes[i];

        if (value == 0) {
            zeroCounter++;
        } else {
            if (zeroCounter >= 4) {
                // Rewind output buffer index
                newIdx -= zeroCounter;
            }

            zeroCounter = 0;
        }

        newBytes[newIdx] = value;
        newIdx++;
    }

    if (zeroCounter >= 4) {
        // Rewind output buffer index for four zero bytes at the end too
        newIdx -= zeroCounter;
    }

    // Copy data into an array that has the correct length
    byte[] finalOut = new byte[newIdx];
    System.arraycopy(newBytes, 0, finalOut, 0, newIdx);

    return finalOut;
}

该实现逐字节复制输入数组。如果检测到零序，输出数组索引将减小（重绕）。在处理输入数组之后，甚至再次复制输出数组以将其长度修剪为实际字节数，因为中间输出数组是用输入数组的长度初始化的
/**
 * Remove four or more zero byte sequences from the input array.
 *  
 * @param inBytes the input array 
 * @return a new array with four or more zero bytes removed form the input array
 */
private static byte[] removeDuplicates(byte[] inBytes) {
    int size = inBytes.length;
    // Use an array with the same size in the first place
    byte[] newBytes = new byte[size];
    byte value;
    int newIdx = 0;
    int zeroCounter = 0;

    for (int i = 0; i < size; i++) {
        value = inBytes[i];

        if (value == 0) {
            zeroCounter++;
        } else {
            if (zeroCounter >= 4) {
                // Rewind output buffer index
                newIdx -= zeroCounter;
            }

            zeroCounter = 0;
        }

        newBytes[newIdx] = value;
        newIdx++;
    }

    if (zeroCounter >= 4) {
        // Rewind output buffer index for four zero bytes at the end too
        newIdx -= zeroCounter;
    }

    // Copy data into an array that has the correct length
    byte[] finalOut = new byte[newIdx];
    System.arraycopy(newBytes, 0, finalOut, 0, newIdx);

    return finalOut;
}

/**
*从输入数组中删除四个或更多零字节序列。
*  
*@param以字节为单位显示输入数组
*@返回从输入数组中删除四个或更多零字节的新数组
*/
专用静态字节[]已移除副本（字节[]内字节）{
int size=inBytes.length；
//首先使用大小相同的数组
字节[]新字节=新字节[大小]；
字节值；
int newIdx=0；
int零计数器=0；
对于（int i=0；i=4）{
//倒带输出缓冲区索引
newIdx-=零计数器；
}
零计数器=0；
}
新字节[newIdx]=值；
newIdx++；
}
如果（零计数器>=4）{
//在末尾也倒带四个零字节的输出缓冲区索引
newIdx-=零计数器；
}
//将数据复制到具有正确长度的数组中
字节[]finalOut=新字节[newIdx]；
数组复制（新字节，0，最终字节，0，新IDX）；
返回终局；
}

有趣的是，第二种方法通过倒带到第一个零字节（三个或更少）并复制这些元素来防止不必要的复制，比第一种方法慢一点
所有三种实现都在奔腾N3700处理器上进行了测试，在一个8 x 32KB的输入阵列上进行了1000次迭代，其中包含若干数量和长度的零序列。与常规Expr相比，性能改进最差