C# 在字节数组中查找模式的最有效方法_C#_.net_.net 4.5

C# 在字节数组中查找模式的最有效方法

c# .net

C# 在字节数组中查找模式的最有效方法,c#,.net,.net-4.5,C#,.net,.net 4.5,我有以下代码： var file = //Memory stream with a file in it var bytes = file.ToArray(); 我需要在字节中搜索指定字节序列的第一次出现（如果有）：0xff，0xd8。（其目的是查找嵌入到文件中的图像）因此，如果例如bytes[6501]包含0xff和bytes[6502]包含0xd8，这是一个匹配，我需要返回的位置索引（6501）或一个新数组，它是bytes数组的副本，除非它没有旧数组中6501以下的键我当前的解决方案是

我有以下代码：

var file = //Memory stream with a file in it
var bytes = file.ToArray();

我需要在

字节中搜索指定字节序列的第一次出现（如果有）：0xff，0xd8。（其目的是查找嵌入到文件中的图像）
因此，如果例如bytes[6501]
包含0xff
和bytes[6502]
包含0xd8
，这是一个匹配，我需要返回的位置索引（6501）或一个新数组，它是bytes数组的副本，除非它没有旧数组中6501以下的键
我当前的解决方案是循环：
 for (var index = 0; index < bytes.Length; index++)
 {
     if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
    ...

for（var index=0；index

但它在处理更大的文件时速度相当慢
有没有更有效的方法来处理这个问题？
简单点怎么样
bytes[] pattern = new bytes[] { 1, 2, 3, 4, 5 };
for (var index = 0, end = bytes.Length - pattern.length; index < end; index++)
{
    bool found = false;
    for(int j = 0; j < pattern.Length && !found; j++)
    {
        found = bytes[index + j] == pattern[j];
    }
    if(found)
        return index;
}

bytes[]模式=新字节[]{1,2,3,4,5}；
for（var index=0，end=bytes.Length-pattern.Length；index

请注意，我很久没有用c语言编写代码了，所以如果有语法错误，请原谅。将其视为伪代码（不再引发索引错误）：
您想使用for循环检查数组。代码速度慢的原因很简单
反编译说明了原因：
public static IEnumerable<TSource> Skip<TSource>(this IEnumerable<TSource> source, int count)
{
  if (source == null)
    throw Error.ArgumentNull("source");
  else
    return Enumerable.SkipIterator<TSource>(source, count);
}

private static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count)
{
  using (IEnumerator<TSource> enumerator = source.GetEnumerator())
  {
    while (count > 0 && enumerator.MoveNext())
      --count;
    if (count <= 0)
    {
      while (enumerator.MoveNext())
        yield return enumerator.Current;
    }
  }
}

公共静态IEnumerable跳过（此IEnumerable源，int计数）
{
if（source==null）
抛出错误。ArgumentNull（“源”）；
其他的
返回可枚举的skipitor（源，计数）；
}
专用静态IEnumerable SkipIterator（IEnumerable源，int计数）
{
使用（IEnumerator enumerator=source.GetEnumerator（））
{
while（计数>0&&enumerator.MoveNext（））
--计数；
如果（计数）简单的线性搜索有缺点吗？

如果找到，则返回起始索引，否则返回-1
private const byte First = 0x0ff;
private const byte Second = 0x0d8;

private static int FindImageStart(IList<byte> bytes) {
    for (var index = 0; index < bytes.Count - 1; index++) {
        if (bytes[index] == First && bytes[index + 1] == Second) {
            return index;
        }
    }
    return -1;
}

private const byte First=0x0ff；
私有常量字节秒=0x0d8；
私有静态int FindImageStart（IList字节）{
对于（var index=0；index
如果这是时间关键型代码，我发现C#编译器（Mono的实现和Microsoft的实现）具有优化简单扫描循环的特殊逻辑
根据分析经验，我将使用硬编码的第一个元素搜索实现序列搜索，如下所示：
/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
///   The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
  int end = array.Length - sequence.Length; // past here no match is possible
  byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing

  while(start <= end) {
    // scan for first byte only. compiler-friendly.
    if(array[start] == firstByte) {
      // scan for rest of sequence
      for (int offset = 1;; ++offset) {
        if(offset == sequence.Length) { // full sequence matched?
          return start;
        } else if(array[start + offset] != sequence[offset]) {
          break;
        }
      }
    }
    ++start;
  }

  // end of array reached without match
  return -1;
}

///在字节数组中查找下一个出现的序列
///将被扫描的阵列
///开始扫描的数组中的索引
///阵列将被扫描的序列
/// 
///如果未找到-1序列下一次出现的索引
/// 
私有静态int-findSequence（字节[]数组，int-start，字节[]序列）{
int end=array.Length-sequence.Length；//超过此处不可能匹配
byte firstByte=sequence[0]；//缓存以告知编译器没有别名
while（开始public int FindSequence（字节[]源，字节[]序列）
{
var start=-1；
对于（变量i=0；i
IndexOutOfRangeException
：）你可能需要索引
为了避免IndexOutOfRangeException
对我不起作用，我有时需要检查长序列（10字节或更多）和多个序列，所以我的if
会一直延伸到月球。（更不用说我已经试过了，它没有明显的速度差异。）更新以修复IndexOutOfRangeException并支持任意长度的模式。请根据您的需要调整它。inner for，&&&！found:在第一次匹配时，它退出。对吗？一件很小的事情-为什么要创建一个新的字节[]
在循环的每次迭代中？我不是，我实际上是在循环之前创建的，在循环本身中我只使用一个变量来引用它，我只是不想让代码示例太复杂。你可以尝试实现一个。这是一个字符串的C#实现，可以作为指导。在处理时，你的RAM用了多少大文件？你考虑过只处理有限大小的块吗？@Dman：不多，那些文件只有几兆字节大（通常是2-10MB）因此RAM不会被占用太多。for
循环的上限应该是字节。Count-1警告-当数组和序列的长度为1时，此算法将提供假阴性结果。此外，当使用偏移量并尝试匹配序列数组中的最后一个字节时，会产生假阳性。Tha谢谢你让我知道。我已经更新了代码，希望它现在能正确处理这两种情况！
/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
///   The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
  int end = array.Length - sequence.Length; // past here no match is possible
  byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing

  while(start <= end) {
    // scan for first byte only. compiler-friendly.
    if(array[start] == firstByte) {
      // scan for rest of sequence
      for (int offset = 1;; ++offset) {
        if(offset == sequence.Length) { // full sequence matched?
          return start;
        } else if(array[start + offset] != sequence[offset]) {
          break;
        }
      }
    }
    ++start;
  }

  // end of array reached without match
  return -1;
}

public int FindSequence(byte[] source, byte[] seq)
{
    var start = -1;
    for (var i = 0; i < source.Length - seq.Length + 1 && start == -1; i++)
    {
        var j = 0;
        for (; j < seq.Length && source[i+j] == seq[j]; j++) {}
        if (j == seq.Length) start = i;
    }
    return start;
}