C# 递归哈希算法的实现_C#_Algorithm_Filecompare

C# 递归哈希算法的实现

c# algorithm

C# 递归哈希算法的实现,c#,algorithm,filecompare,C#,Algorithm,Filecompare,假设文件A包含以下字节： 2 5 8 0 33 90 1 3 200 201 23 12 55 我有一个简单的散列算法，存储最后三个连续字节的总和，这样： 2 5 8 - = 8+5+2 = 15 0 33 90 - = 90+33+0 = 123 1 3 200 - = 204 201 23 12 - = 236 因此，我将能够将文件A表示为15123204236 假设我将该文件复制到一台新的计算机B上，并做了一些小的修改，文件B的字节是：

假设文件A包含以下字节：

我有一个简单的散列算法，存储最后三个连续字节的总和，这样：

2   
5   
8   - = 8+5+2 = 15
0   
33  
90  - = 90+33+0 = 123
1   
3   
200 - = 204
201 
23  
12  - = 236

因此，我将能够将文件A表示为

15123204236

假设我将该文件复制到一台新的计算机B上，并做了一些小的修改，文件B的字节是：

“请注意，区别在于文件开头多了一个字节，结尾多了两个字节，但其余部分非常相似”

因此，我可以执行相同的算法来确定文件的某些部分是否相同。请记住，文件A由散列代码表示

15、123、204、236

让我们看看文件B是否提供了一些散列代码

所以在文件B上，我必须每3个连续字节做一次

int[] sums; // array where we will hold the sum of the last bytes


255 sums[0]  =          255     
2   sums[1]  =  2+ sums[0]    = 257     
5   sums[2]  =  5+ sums[1]    = 262     
8   sums[3]  =  8+ sums[2]    = 270  hash = sums[3]-sums[0]   = 15   --> MATHES FILE A!
0   sums[4]  =  0+ sums[3]    = 270  hash = sums[4]-sums[1]   = 13
33  sums[5]  =  33+ sums[4]   = 303  hash = sums[5]-sums[2]   = 41
90  sums[6]  =  90+ sums[5]   = 393  hash = sums[6]-sums[3]   = 123  --> MATHES FILE A!
1   sums[7]  =  1+ sums[6]    = 394  hash = sums[7]-sums[4]   = 124
3   sums[8]  =  3+ sums[7]    = 397  hash = sums[8]-sums[5]   = 94
200 sums[9]  =  200+ sums[8]  = 597  hash = sums[9]-sums[6]   = 204  --> MATHES FILE A!
201 sums[10] =  201+ sums[9]  = 798  hash = sums[10]-sums[7]  = 404
23  sums[11] =  23+ sums[10]  = 821  hash = sums[11]-sums[8]  = 424
12  sums[12] =  12+ sums[11]  = 833  hash = sums[12]-sums[9]  = 236  --> MATHES FILE A!
55  sums[13] =  55+ sums[12]  = 888  hash = sums[13]-sums[10] = 90
255 sums[14] =  255+ sums[13] = 1143    hash = sums[14]-sums[11] =  322
255 sums[15] =  255+ sums[14] = 1398    hash = sums[15]-sums[12] =  565

因此，通过查看该表，我知道文件B包含来自文件A的字节加上其他字节，因为散列码匹配

我之所以展示这个算法，是因为它的阶数是n，换句话说，我能够计算最后3个连续字节的哈希值，而不必迭代它们

如果我需要一个更复杂的算法，比如最后3个字节的md5，那么它的顺序将是n^3，这是因为当我迭代文件B时，我必须有一个内部for循环来计算最后3个字节的散列

所以我的问题是：如何改进算法，使其保持n阶。这就是只计算一次散列。如果我使用现有的散列算法，如md5，我将不得不在算法内部放置一个内环，这将显著提高算法的顺序

注意，用乘法代替加法也可以做同样的事情。但是计数器的增长速度非常快。也许我可以把乘法和加减法结合起来

编辑如果我在谷歌上搜索：

gram中的递归散列函数

出现了很多信息，我认为这些算法很难理解

我必须为一个项目实现这个算法，这就是为什么我要重新发明轮子。。。我知道有很多算法

另外一个我一直在想的替代解决方案是执行相同的算法加上另一个强大的算法。所以在文件A上，我将每3个字节执行相同的算法，再加上每3个字节的md5。对于第二个文件，如果第一个算法实现，我将只执行第二个算法。…

Edit: 我越想你所说的“递归”是什么意思，我就越怀疑我前面介绍的解决方案是你应该实现什么来做任何有用的事情

您可能想要，这是一个递归操作

为此，您对列表进行散列，将列表一分为二，然后递归到这两个子列表中。当列表的大小为1或所需的最小散列大小时终止，因为每个递归级别将使总散列输出的大小加倍

伪代码：

create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

我认为整个算法的运行时间是

O（hash（m））；m=n*（log（n）+1）

，其中

散列（m）

通常是线性时间

存储空间类似于

O（hash*s）；s=2n-1

，散列通常为常量大小

请注意，对于C#，我会将输出列表设置为

列表

，但我会将输入列表设置为

IEnumerable

，以节省存储空间，并使用Linq快速“拆分”列表，而无需分配两个子列表

原件：我想你可以得到执行时间

O（n+m）

；其中，

是列表的大小，

是正在运行的计数的大小，

（否则所有的总和将相等）
双端排队
内存消耗将是堆栈大小，加上临时存储的大小m

为此，请使用双端队列和运行总数。将新遇到的值添加到列表中，同时添加到运行总数，当队列达到大小m
时，从列表中弹出并从运行总数中减去
下面是一些伪代码：
create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

这不是递归的，而是迭代的
此算法的运行如下所示（对于m=3
）：
带索引
您可以从最后的m
值的总和开始，并使用索引的偏移量而不是退出队列，例如数组[i-m]
，从而删除队列和重新分配任何插槽
这不会减少执行时间，因为您仍然需要有两个循环，一个用于建立运行计数，另一个用于填充所有值。但它只会减少堆栈空间的内存使用（有效地O（1）
）
下面是一些伪代码：
create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

前面的m插槽是棘手的部分。您可以将其拆分为两个循环：

索引从列表末尾开始，减去m，再加上i
一个从i减去m的索引

或者，当i-m<0
时，可以使用模运算来“包装”值：
int valueToSutract = array[(i - m) % n];

编辑：
我越想你所说的“递归”是什么意思，我就越怀疑我前面介绍的解决方案是你应该实现什么来做任何有用的事情
您可能想要，这是一个递归操作
为此，您对列表进行散列，将列表一分为二，然后递归到这两个子列表中。当列表的大小为1或所需的最小散列大小时终止，因为每个递归级别将使总散列输出的大小加倍
伪代码：
create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

我认为整个算法的运行时间是O（hash（m））；m=n*（log（n）+1）
，其中散列（m）
通常是线性时间
存储空间类似于O（hash*s）；s=2n-1
，散列通常为常量大小
注意，对于C#，我将输出列表设为a