Java 在另一个字节数组中查找一个字节数组的索引
给定一个字节数组,如何在其中找到(较小)字节数组的位置 使用Java 在另一个字节数组中查找一个字节数组的索引,java,search,bytearray,Java,Search,Bytearray,给定一个字节数组,如何在其中找到(较小)字节数组的位置 使用ArrayUtils看起来很有希望,但如果我是正确的,它只允许我在数组中查找要搜索的单个字节 (我看不出这有什么关系,但只是以防万一:有时搜索字节数组是常规ASCII字符,有时是控制字符或扩展ASCII字符。因此使用字符串操作并不总是合适的) 大的数组可能在10到10000字节之间,小的数组大约在10字节左右。在某些情况下,我将有几个较小的数组,我希望在一次搜索中在较大的数组中找到它们。我有时会想找到实例的最后一个索引,而不是第一个。最
ArrayUtils
看起来很有希望,但如果我是正确的,它只允许我在数组中查找要搜索的单个字节
(我看不出这有什么关系,但只是以防万一:有时搜索字节数组是常规ASCII字符,有时是控制字符或扩展ASCII字符。因此使用字符串操作并不总是合适的)
大的数组可能在10到10000字节之间,小的数组大约在10字节左右。在某些情况下,我将有几个较小的数组,我希望在一次搜索中在较大的数组中找到它们。我有时会想找到实例的最后一个索引,而不是第一个。最简单的方法是比较每个元素:
public int indexOf(byte[] outerArray, byte[] smallerArray) {
for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
boolean found = true;
for(int j = 0; j < smallerArray.length; ++j) {
if (outerArray[i+j] != smallerArray[j]) {
found = false;
break;
}
}
if (found) return i;
}
return -1;
}
当您更新您的问题时:Java字符串是UTF-16字符串,它们不关心扩展的ASCII集,因此您可以使用string.indexOf()Java字符串由16位
字符组成,而不是由8位字节组成。char
可以容纳byte
,因此您可以始终将字节数组设置为字符串,并使用indexOf
:ASCII字符、控制字符甚至零字符都可以正常工作
下面是一个演示:
byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));
但是,考虑到您的大阵列最多可以有10000个字节,而小阵列只有10个字节,此解决方案可能不是最有效的,原因有两个:
- 它需要将大数组复制到两倍大的数组中(容量相同,但使用
char
而不是byte
)。这将使您的内存需求增加三倍
- Java的字符串搜索算法不是最快的。如果您实现一种高级算法,例如。这可能会将执行速度降低十倍(小字符串的长度),并且需要与小字符串的长度成比例的额外内存,而不是与大字符串的长度成比例
这就是你要找的吗
public class KPM {
/**
* Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
* @param data
* @param start First index in data
* @param stop Last index in data so that stop-start = length
* @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
* @return
*/
public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
if( data == null || pattern == null) return -1;
int[] failure = computeFailure(pattern);
int j = 0;
for( int i = start; i < stop; i++) {
while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
j = failure[j - 1];
}
if (pattern[j] == '*' || pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
公共类KPM{
/**
*在数据字节数组中搜索给定边界内第一次出现的字节数组模式。
*@param数据
*@param启动数据中的第一个索引
*@param停止数据中的最后一个索引,以便停止开始=长度
*@param pattern正在搜索的内容。“*”可以用作“任意字符”的通配符
*@返回
*/
公共静态int indexOf(字节[]数据、int开始、int停止、字节[]模式){
if(data==null | | pattern==null)返回-1;
int[]故障=计算故障(模式);
int j=0;
for(int i=开始;i<停止;i++){
而(j>0&(模式[j]!='*'&&pattern[j]!=data[i])){
j=故障[j-1];
}
如果(模式[j]='*'| |模式[j]==数据[i]){
j++;
}
if(j==模式长度){
返回i-模式长度+1;
}
}
返回-1;
}
/**
*使用引导过程计算故障函数,
*模式与自身相匹配。
*/
私有静态int[]计算失败(字节[]模式){
int[]失败=新的int[pattern.length];
int j=0;
for(int i=1;i0&&pattern[j]!=pattern[i]){
j=故障[j-1];
}
if(模式[j]==模式[i]){
j++;
}
失效[i]=j;
}
返回失败;
}
}
为节省测试时间:
为您提供使computeFailure()为静态的代码:
Google的Guava提供了Bytes.indexOf(byte[]数组,byte[]目标)
因此,你可以在byte[]中找到byte[]的索引
Github上的示例位于:从复制的内容几乎相同
indexOf(char[],int,int,char[]int,int,int)
静态int indexOf(字节[]源、int源偏移量、int源计数、字节[]目标、int目标偏移量、int目标计数、int fromIndex){
if(fromIndex>=sourceCount){
返回(targetCount==0?sourceCount:-1);
}
如果(从索引<0){
fromIndex=0;
}
如果(targetCount==0){
从索引返回;
}
字节第一=目标[targetOffset];
int max=sourceOffset+(sourceCount-targetCount);
对于(int i=sourceOffset+fromIndex;i,使用是最有效的方法
是它的一个实现,是Twitter的大象鸟项目的一部分
不建议包含此库,因为仅使用一个类就相当大
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher
{
private byte[] pattern_;
private int[] borders_;
// An upper bound on pattern length for searching. Results are undefined for longer patterns.
@SuppressWarnings("unused")
public static final int MAX_PATTERN_LENGTH = 1024;
StreamSearcher(byte[] pattern)
{
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
*
* @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern)
{
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* @return bytes consumed if found, -1 otherwise.
*/
long search(InputStream stream) throws IOException
{
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1)
{
bytesRead++;
while (j >= 0 && (byte) b != pattern_[j])
{
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length)
{
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
private void preProcess()
{
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length)
{
while (j >= 0 && pattern_[i] != pattern_[j])
{
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}
import java.io.IOException;
导入java.io.InputStream;
导入java.util.array;
/**
*基于Knuth-Morris-Pratt算法的高效流搜索类。
*有关算法工作原理的更多信息,请参阅:http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
公共类流搜索器
{
专用字节[]模式;
私有int[]边界;
//用于搜索的模式长度上限。对于较长的模式,结果未定义。
@抑制警告(“未使用”)
公共静态最终int MAX_PATTERN_LENGTH=1024;
StreamSearcher(字节[]模式)
{
设置模式(模式);
}
/**
*设置此StreamSearcher要使用的新模式。
*
*@param pattern StreamSearcher将在以后的搜索调用中查找的模式(…)
*/
公共void setPattern(字节[]模式)
{
pattern=数组.copyOf(pattern,pattern.length);
边框=新整数[图案长度+1];
预处理();
}
/**
*搜索流中模式的下一个匹配项,
public class KPM {
/**
* Search the data byte array for the first occurrence
* of the byte array pattern.
*/
public static int indexOf(byte[] data, byte[] pattern) {
int[] failure = computeFailure(pattern);
int j = 0;
for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data[i]) {
j = failure[j - 1];
}
if (pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
public class Test {
public static void main(String[] args) {
do_test1();
}
static void do_test1() {
String[] ss = { "",
"\r\n\r\n",
"\n\n",
"\r\n\r\nthis is a test",
"this is a test\r\n\r\n",
"this is a test\r\n\r\nthis si a test",
"this is a test\r\n\r\nthis si a test\r\n\r\n",
"this is a test\n\r\nthis si a test",
"this is a test\r\nthis si a test\r\n\r\n",
"this is a test"
};
for (String s: ss) {
System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
}
}
}
package org.example;
import java.util.List;
import org.riversun.finbin.BinarySearcher;
public class Sample2 {
public static void main(String[] args) throws Exception {
BinarySearcher bs = new BinarySearcher();
// UTF-8 without BOM
byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");
byte[] searchBytes = "world".getBytes("utf-8");
List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);
System.out.println("indexList=" + indexList);
}
}
indexList=[6, 25]
static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
byte first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first)
;
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
;
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher
{
private byte[] pattern_;
private int[] borders_;
// An upper bound on pattern length for searching. Results are undefined for longer patterns.
@SuppressWarnings("unused")
public static final int MAX_PATTERN_LENGTH = 1024;
StreamSearcher(byte[] pattern)
{
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
*
* @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern)
{
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* @return bytes consumed if found, -1 otherwise.
*/
long search(InputStream stream) throws IOException
{
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1)
{
bytesRead++;
while (j >= 0 && (byte) b != pattern_[j])
{
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length)
{
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
private void preProcess()
{
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length)
{
while (j >= 0 && pattern_[i] != pattern_[j])
{
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}
private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
{
try
{
String n = "--"+boundary;
byte[] bc = n.getBytes("UTF-8");
int s = bc.length;
byte[] b = new byte[s];
int p = 0;
long l = 0;
int c;
boolean r;
while ((c = is.read()) != -1)
{
b[p] = (byte) c;
l += 1;
p = (int) (l % s);
if (l>p)
{
r = true;
for (int i = 0; i < s; i++)
{
if (b[(p + i) % s] != bc[i])
{
r = false;
break;
}
}
if (r)
break;
os.write(b[p]);
}
}
os.flush();
return true;
} catch(IOException e) {e.printStackTrace();}
return false;
}
// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).
public class KMPSearch {
public static int indexOf(byte[] haystack, byte[] needle)
{
// needle is null or empty
if (needle == null || needle.length == 0)
return 0;
// haystack is null, or haystack's length is less than that of needle
if (haystack == null || needle.length > haystack.length)
return -1;
// pre construct failure array for needle pattern
int[] failure = new int[needle.length];
int n = needle.length;
failure[0] = -1;
for (int j = 1; j < n; j++)
{
int i = failure[j - 1];
while ((needle[j] != needle[i + 1]) && i >= 0)
i = failure[i];
if (needle[j] == needle[i + 1])
failure[j] = i + 1;
else
failure[j] = -1;
}
// find match
int i = 0, j = 0;
int haystackLen = haystack.length;
int needleLen = needle.length;
while (i < haystackLen && j < needleLen)
{
if (haystack[i] == needle[j])
{
i++;
j++;
}
else if (j == 0)
i++;
else
j = failure[j - 1] + 1;
}
return ((j == needleLen) ? (i - needleLen) : -1);
}
}
import java.util.Random;
class KMPSearchTest {
private static Random random = new Random();
private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
@Test
public void testEmpty() {
test("", "");
test("", "ab");
}
@Test
public void testOneChar() {
test("a", "a");
test("a", "b");
}
@Test
public void testRepeat() {
test("aaa", "aaaaa");
test("aaa", "abaaba");
test("abab", "abacababc");
test("abab", "babacaba");
}
@Test
public void testPartialRepeat() {
test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
}
@Test
public void testRandomly() {
for (int i = 0; i < 1000; i++) {
String pattern = randomPattern();
for (int j = 0; j < 100; j++)
test(pattern, randomText(pattern));
}
}
/* Helper functions */
private static String randomPattern() {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(10) + 1;
for (int i = 0; i < steps; i++) {
if (sb.length() == 0 || random.nextBoolean()) { // Add literal
int len = random.nextInt(5) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
} else { // Repeat prefix
int len = random.nextInt(sb.length()) + 1;
int reps = random.nextInt(3) + 1;
if (sb.length() + len * reps > 1000)
break;
for (int j = 0; j < reps; j++)
sb.append(sb.substring(0, len));
}
}
return sb.toString();
}
private static String randomText(String pattern) {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(100);
for (int i = 0; i < steps && sb.length() < 10000; i++) {
if (random.nextDouble() < 0.7) { // Add prefix of pattern
int len = random.nextInt(pattern.length()) + 1;
sb.append(pattern.substring(0, len));
} else { // Add literal
int len = random.nextInt(30) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
}
}
return sb.toString();
}
private static void test(String pattern, String text) {
try {
assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
} catch (AssertionError e) {
System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
}
}
}