Java 在另一个字节数组中查找一个字节数组的索引_Java_Search_Bytearray

Java 在另一个字节数组中查找一个字节数组的索引

java search

Java 在另一个字节数组中查找一个字节数组的索引,java,search,bytearray,Java,Search,Bytearray,给定一个字节数组，如何在其中找到（较小）字节数组的位置使用ArrayUtils看起来很有希望，但如果我是正确的，它只允许我在数组中查找要搜索的单个字节（我看不出这有什么关系，但只是以防万一：有时搜索字节数组是常规ASCII字符，有时是控制字符或扩展ASCII字符。因此使用字符串操作并不总是合适的）大的数组可能在10到10000字节之间，小的数组大约在10字节左右。在某些情况下，我将有几个较小的数组，我希望在一次搜索中在较大的数组中找到它们。我有时会想找到实例的最后一个索引，而不是第一个。最

给定一个字节数组，如何在其中找到（较小）字节数组的位置

使用

ArrayUtils

看起来很有希望，但如果我是正确的，它只允许我在数组中查找要搜索的单个字节

（我看不出这有什么关系，但只是以防万一：有时搜索字节数组是常规ASCII字符，有时是控制字符或扩展ASCII字符。因此使用字符串操作并不总是合适的）

大的数组可能在10到10000字节之间，小的数组大约在10字节左右。在某些情况下，我将有几个较小的数组，我希望在一次搜索中在较大的数组中找到它们。我有时会想找到实例的最后一个索引，而不是第一个。

最简单的方法是比较每个元素：

public int indexOf(byte[] outerArray, byte[] smallerArray) {
    for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
        boolean found = true;
        for(int j = 0; j < smallerArray.length; ++j) {
           if (outerArray[i+j] != smallerArray[j]) {
               found = false;
               break;
           }
        }
        if (found) return i;
     }
   return -1;  
}

当您更新您的问题时：Java字符串是UTF-16字符串，它们不关心扩展的ASCII集，因此您可以使用string.indexOf（）

Java字符串由16位

字符组成，而不是由8位字节组成。char
可以容纳byte
，因此您可以始终将字节数组设置为字符串，并使用indexOf
：ASCII字符、控制字符甚至零字符都可以正常工作
下面是一个演示：
byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));


但是，考虑到您的大阵列最多可以有10000个字节，而小阵列只有10个字节，此解决方案可能不是最有效的，原因有两个：

它需要将大数组复制到两倍大的数组中（容量相同，但使用char
而不是byte
）。这将使您的内存需求增加三倍
Java的字符串搜索算法不是最快的。如果您实现一种高级算法，例如。这可能会将执行速度降低十倍（小字符串的长度），并且需要与小字符串的长度成比例的额外内存，而不是与大字符串的长度成比例
这就是你要找的吗
public class KPM {
    /**
     * Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
     * @param data
     * @param start First index in data
     * @param stop Last index in data so that stop-start = length
     * @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
     * @return
     */
    public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
        if( data == null || pattern == null) return -1;

        int[] failure = computeFailure(pattern);

        int j = 0;

        for( int i = start; i < stop; i++) {
            while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
                j = failure[j - 1];
            }
            if (pattern[j] == '*' || pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j>0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}

公共类KPM{
/**
*在数据字节数组中搜索给定边界内第一次出现的字节数组模式。
*@param数据
*@param启动数据中的第一个索引
*@param停止数据中的最后一个索引，以便停止开始=长度
*@param pattern正在搜索的内容。“*”可以用作“任意字符”的通配符
*@返回
*/
公共静态int indexOf（字节[]数据、int开始、int停止、字节[]模式）{
if（data==null | | pattern==null）返回-1；
int[]故障=计算故障（模式）；
int j=0；
for（int i=开始；i<停止；i++）{
而（j>0&（模式[j]！='*'&&pattern[j]！=data[i]））{
j=故障[j-1]；
}
如果（模式[j]='*'| |模式[j]==数据[i]）{
j++；
}
if（j==模式长度）{
返回i-模式长度+1；
}
}
返回-1；
}
/**
*使用引导过程计算故障函数，
*模式与自身相匹配。
*/
私有静态int[]计算失败（字节[]模式）{
int[]失败=新的int[pattern.length]；
int j=0；
for（int i=1；i0&&pattern[j]！=pattern[i]）{
j=故障[j-1]；
}
if（模式[j]==模式[i]）{
j++；
}
失效[i]=j；
}
返回失败；
}
}
为节省测试时间：

为您提供使computeFailure（）为静态的代码：
Google的Guava提供了Bytes.indexOf（byte[]数组，byte[]目标）
因此，你可以在byte[]中找到byte[]的索引
Github上的示例位于：
从复制的内容几乎相同
indexOf（char[]，int，int，char[]int，int，int）

静态int indexOf（字节[]源、int源偏移量、int源计数、字节[]目标、int目标偏移量、int目标计数、int fromIndex）{
if（fromIndex>=sourceCount）{
返回（targetCount==0？sourceCount:-1）；
}
如果（从索引<0）{
fromIndex=0；
}
如果（targetCount==0）{
从索引返回；
}
字节第一=目标[targetOffset]；
int max=sourceOffset+（sourceCount-targetCount）；
对于（int i=sourceOffset+fromIndex；i，使用是最有效的方法
是它的一个实现，是Twitter的大象鸟项目的一部分
不建议包含此库，因为仅使用一个类就相当大
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher
{
    private byte[] pattern_;
    private int[] borders_;

    // An upper bound on pattern length for searching. Results are undefined for longer patterns.
    @SuppressWarnings("unused")
    public static final int MAX_PATTERN_LENGTH = 1024;

    StreamSearcher(byte[] pattern)
    {
        setPattern(pattern);
    }

    /**
     * Sets a new pattern for this StreamSearcher to use.
     *
     * @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
     */
    public void setPattern(byte[] pattern)
    {
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
    }

    /**
     * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
     * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
     * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
     * another reasonable default, i.e. leave the stream unchanged.
     *
     * @return bytes consumed if found, -1 otherwise.
     */
    long search(InputStream stream) throws IOException
    {
        long bytesRead = 0;

        int b;
        int j = 0;

        while ((b = stream.read()) != -1)
        {
            bytesRead++;

            while (j >= 0 && (byte) b != pattern_[j])
            {
                j = borders_[j];
            }
            // Move to the next character in the pattern.
            ++j;

            // If we've matched up to the full pattern length, we found it.  Return,
            // which will automatically save our position in the InputStream at the point immediately
            // following the pattern match.
            if (j == pattern_.length)
            {
                return bytesRead;
            }
        }

        // No dice, Note that the stream is now completely consumed.
        return -1;
    }

    /**
     * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
     * and aids in implementation of the Knuth-Moore-Pratt string search.
     * <p>
     * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    private void preProcess()
    {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length)
        {
            while (j >= 0 && pattern_[i] != pattern_[j])
            {
                j = borders_[j];
            }
            borders_[++i] = ++j;
        }
    }
}

import java.io.IOException；
导入java.io.InputStream；
导入java.util.array；
/**
*基于Knuth-Morris-Pratt算法的高效流搜索类。
*有关算法工作原理的更多信息，请参阅：http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
公共类流搜索器
{
专用字节[]模式；
私有int[]边界；
//用于搜索的模式长度上限。对于较长的模式，结果未定义。
@抑制警告（“未使用”）
公共静态最终int MAX_PATTERN_LENGTH=1024；
StreamSearcher（字节[]模式）
{
设置模式（模式）；
}
/**
*设置此StreamSearcher要使用的新模式。
*
*@param pattern StreamSearcher将在以后的搜索调用中查找的模式（…）
*/
公共void setPattern（字节[]模式）
{
pattern=数组.copyOf（pattern，pattern.length）；
边框=新整数[图案长度+1]；
预处理（）；
}
/**
*搜索流中模式的下一个匹配项，
public class KPM {
    /**
     * Search the data byte array for the first occurrence 
     * of the byte array pattern.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
    int[] failure = computeFailure(pattern);

    int j = 0;

    for (int i = 0; i < data.length; i++) {
        while (j > 0 && pattern[j] != data[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == data[i]) { 
            j++; 
        }
        if (j == pattern.length) {
            return i - pattern.length + 1;
        }
    }
    return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
    int[] failure = new int[pattern.length];

    int j = 0;
    for (int i = 1; i < pattern.length; i++) {
        while (j>0 && pattern[j] != pattern[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == pattern[i]) {
            j++;
        }
        failure[i] = j;
    }

    return failure;
    }
}

public class Test {
    public static void main(String[] args) {
        do_test1();
    }
    static void do_test1() {
      String[] ss = { "",
                    "\r\n\r\n",
                    "\n\n",
                    "\r\n\r\nthis is a test",
                    "this is a test\r\n\r\n",
                    "this is a test\r\n\r\nthis si a test",
                    "this is a test\r\n\r\nthis si a test\r\n\r\n",
                    "this is a test\n\r\nthis si a test",
                    "this is a test\r\nthis si a test\r\n\r\n",
                    "this is a test"
                };
      for (String s: ss) {
        System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
      }

    }
}

package org.example;

import java.util.List;

import org.riversun.finbin.BinarySearcher;

public class Sample2 {

    public static void main(String[] args) throws Exception {

        BinarySearcher bs = new BinarySearcher();

        // UTF-8 without BOM
        byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");

        byte[] searchBytes = "world".getBytes("utf-8");

        List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);

        System.out.println("indexList=" + indexList);
    }
 }

indexList=[6, 25]

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    byte first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first)
                ;
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                ;

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher
{
    private byte[] pattern_;
    private int[] borders_;

    // An upper bound on pattern length for searching. Results are undefined for longer patterns.
    @SuppressWarnings("unused")
    public static final int MAX_PATTERN_LENGTH = 1024;

    StreamSearcher(byte[] pattern)
    {
        setPattern(pattern);
    }

    /**
     * Sets a new pattern for this StreamSearcher to use.
     *
     * @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
     */
    public void setPattern(byte[] pattern)
    {
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
    }

    /**
     * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
     * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
     * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
     * another reasonable default, i.e. leave the stream unchanged.
     *
     * @return bytes consumed if found, -1 otherwise.
     */
    long search(InputStream stream) throws IOException
    {
        long bytesRead = 0;

        int b;
        int j = 0;

        while ((b = stream.read()) != -1)
        {
            bytesRead++;

            while (j >= 0 && (byte) b != pattern_[j])
            {
                j = borders_[j];
            }
            // Move to the next character in the pattern.
            ++j;

            // If we've matched up to the full pattern length, we found it.  Return,
            // which will automatically save our position in the InputStream at the point immediately
            // following the pattern match.
            if (j == pattern_.length)
            {
                return bytesRead;
            }
        }

        // No dice, Note that the stream is now completely consumed.
        return -1;
    }

    /**
     * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
     * and aids in implementation of the Knuth-Moore-Pratt string search.
     * <p>
     * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    private void preProcess()
    {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length)
        {
            while (j >= 0 && pattern_[i] != pattern_[j])
            {
                j = borders_[j];
            }
            borders_[++i] = ++j;
        }
    }
}

    private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
    {
        try
        {
            String n = "--"+boundary;
            byte[] bc = n.getBytes("UTF-8");
            int s = bc.length;
            byte[] b = new byte[s];
            int p = 0;
            long l = 0;
            int c;
            boolean r;
            while ((c = is.read()) != -1)
            {
                b[p] = (byte) c;
                l += 1;
                p = (int) (l % s);
                if (l>p)
                {
                    r = true;
                    for (int i = 0; i < s; i++)
                    {
                        if (b[(p + i) % s] != bc[i])
                        {
                            r = false;
                            break;
                        }
                    }
                    if (r)
                        break;
                    os.write(b[p]);
                }
            }
            os.flush();
            return true;
        } catch(IOException e) {e.printStackTrace();}
        return false;
    }

// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).

public class KMPSearch {

    public static int indexOf(byte[] haystack, byte[] needle)
    {
        // needle is null or empty
        if (needle == null || needle.length == 0)
            return 0;

        // haystack is null, or haystack's length is less than that of needle
        if (haystack == null || needle.length > haystack.length)
            return -1;

        // pre construct failure array for needle pattern
        int[] failure = new int[needle.length];
        int n = needle.length;
        failure[0] = -1;
        for (int j = 1; j < n; j++)
        {
            int i = failure[j - 1];
            while ((needle[j] != needle[i + 1]) && i >= 0)
                i = failure[i];
            if (needle[j] == needle[i + 1])
                failure[j] = i + 1;
            else
                failure[j] = -1;
        }

        // find match
        int i = 0, j = 0;
        int haystackLen = haystack.length;
        int needleLen = needle.length;
        while (i < haystackLen && j < needleLen)
        {
            if (haystack[i] == needle[j])
            {
                i++;
                j++;
            }
            else if (j == 0)
                i++;
            else
                j = failure[j - 1] + 1;
        }
        return ((j == needleLen) ? (i - needleLen) : -1);
    }
}



import java.util.Random;

class KMPSearchTest {
    private static Random random = new Random();
    private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

    @Test
    public void testEmpty() {
        test("", "");
        test("", "ab");
    }

    @Test
    public void testOneChar() {
        test("a", "a");
        test("a", "b");
    }

    @Test
    public void testRepeat() {
        test("aaa", "aaaaa");
        test("aaa", "abaaba");
        test("abab", "abacababc");
        test("abab", "babacaba");
    }

    @Test
    public void testPartialRepeat() {
        test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
        test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
    }

    @Test
    public void testRandomly() {
        for (int i = 0; i < 1000; i++) {
            String pattern = randomPattern();
            for (int j = 0; j < 100; j++)
                test(pattern, randomText(pattern));
        }
    }

    /* Helper functions */
    private static String randomPattern() {
        StringBuilder sb = new StringBuilder();
        int steps = random.nextInt(10) + 1;
        for (int i = 0; i < steps; i++) {
            if (sb.length() == 0 || random.nextBoolean()) {  // Add literal
                int len = random.nextInt(5) + 1;
                for (int j = 0; j < len; j++)
                    sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
            } else {  // Repeat prefix
                int len = random.nextInt(sb.length()) + 1;
                int reps = random.nextInt(3) + 1;
                if (sb.length() + len * reps > 1000)
                    break;
                for (int j = 0; j < reps; j++)
                    sb.append(sb.substring(0, len));
            }
        }
        return sb.toString();
    }

    private static String randomText(String pattern) {
        StringBuilder sb = new StringBuilder();
        int steps = random.nextInt(100);
        for (int i = 0; i < steps && sb.length() < 10000; i++) {
            if (random.nextDouble() < 0.7) {  // Add prefix of pattern
                int len = random.nextInt(pattern.length()) + 1;
                sb.append(pattern.substring(0, len));
            } else {  // Add literal
                int len = random.nextInt(30) + 1;
                for (int j = 0; j < len; j++)
                    sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
            }
        }
        return sb.toString();
    }

    private static void test(String pattern, String text) {
        try {
            assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
        } catch (AssertionError e) {
            System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
        }
    }
}