C++ 适应Boyer-Moore实现_C++_C_Algorithm_Implementation_Boyer Moore

C++ 适应Boyer-Moore实现

c++ c algorithm

C++ 适应Boyer-Moore实现,c++,c,algorithm,implementation,boyer-moore,C++,C,Algorithm,Implementation,Boyer Moore,我正在尝试修改Boyer Moore c（++）以获得字符串中模式的所有匹配项。实际上，Wikipedia实现返回第一个匹配项。主代码如下所示： char*boyer\u moore（uint8\u t*string、uint32\u t stringlen、uint8\u t*pat、uint32\u t patlen）{ int i； int delta1[字母表]； int*delta2=malloc（patlen*sizeof（int））；制作delta1（delta1、pat、pat

我正在尝试修改Boyer Moore c（++）以获得字符串中模式的所有匹配项。实际上，Wikipedia实现返回第一个匹配项。主代码如下所示：

char*boyer\u moore（uint8\u t*string、uint32\u t stringlen、uint8\u t*pat、uint32\u t patlen）{
int i；
int delta1[字母表]；
int*delta2=malloc（patlen*sizeof（int））；
制作delta1（delta1、pat、patlen）；
制作delta2（delta2、pat、patlen）；
i=patlen-1；
而（我=0&（string[i]==pat[j]））{
--一,；
--j；
}
if（j<0）{
免费（delta2）；
返回（字符串+i+1）；
}
i+=max（delta1[string[i]]，delta2[j]）；
}
免费（delta2）；
返回NULL；
}

我尝试在if（j<0）之后修改块，将索引添加到数组/向量，并让外部循环继续，但它似乎不起作用。在测试修改后的代码时，我仍然只得到一个匹配项。也许这个实现并不是为了返回所有匹配项而设计的，它需要更多的快速更改来实现吗？我对算法本身不是很了解，所以我不知道如何使它工作。如果有人能给我指出正确的方向，我将不胜感激

注意：函数make_delta1和make_delta2是在源代码的前面定义的（查看Wikipedia页面），max（）函数调用实际上也是源代码前面定义的一个宏。

Boyer Moore的算法利用了这样一个事实，即在较长的字符串中搜索“HELLO WORLD”时，你在一个给定位置上找到的字母限制了你在该位置周围能找到的东西，如果要找到匹配的话，有点像海战游戏：如果你在离边境四个牢房处找到公海，你不需要测试剩下的四个牢房，以防有一艘五个牢房的航母藏在那里；不可能

例如，如果你在第十一位找到一个“D”，它可能是HELLO WORLD的最后一个字母；但是，如果您发现“Q”，“Q”不在HELLO WORLD中的任何位置，这意味着搜索的字符串不能在前11个字符中的任何位置，您可以完全避免在那里搜索。另一方面，“L”可能意味着HELLO WORLD在那里，从位置11-3（HELLO WORLD的第三个字母是L）、11-4或11-10开始

搜索时，使用两个增量数组跟踪这些可能性

所以当你找到一个模式时，你应该

if (j < 0)
{
    // Found a pattern from position i+1 to i+1+patlen
    // Add vector or whatever is needed; check we don't overflow it.
    if (index_size+1 >= index_counter)
    {
        index[index_counter] = 0;
        return index_size;
    }
    index[index_counter++] = i+1;

    // Reinitialize j to restart search
    j = patlen-1;

    // Reinitialize i to start at i+1+patlen
    i += patlen +1; // (not completely sure of that +1)

    // Do not free delta2
    // free(delta2);

    // Continue loop without altering i again
    continue;
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
index[index_counter] = 0;
return index_counter;

然后我修改了if（j<0）以简单地输出它所发现的内容

    if (j < 0) {
            printf("Found %s at offset %d: %s\n", pat, i+1, string+i+1);
            //free(delta2);
            // return (string + i+1);
            i += patlen + 1;
            j = patlen - 1;
            continue;
    }

正如所料：

Found string at offset 10: string in which I am going to look for a string I will string along
Found string at offset 51: string I will string along
Found string at offset 65: string along

如果字符串包含两个重叠序列，则会同时找到这两个序列：

char *s = "This is an andean andeandean andean trouble";
char *p = "andean";

Found andean at offset 11: andean andeandean andean trouble
Found andean at offset 18: andeandean andean trouble
Found andean at offset 22: andean andean trouble
Found andean at offset 29: andean trouble

为了避免重叠匹配，最快的方法是不存储重叠。可以在函数中完成，但这意味着重新初始化第一个增量向量并更新字符串指针；我们还需要将第二个

索引存储为

i2

，以防止保存的索引变为非单调索引。这不值得。更好：

    if (j < 0) {
        // We have found a patlen match at i+1
        // Is it an overlap?
        if (index && (indexes[index] + patlen < i+1))
        {
            // Yes, it is. So we don't store it.


            // We could store the last of several overlaps
            // It's not exactly trivial, though:
            // searching 'anana' in 'Bananananana'
            // finds FOUR matches, and the fourth is NOT overlapped
            // with the first. So in case of overlap, if we want to keep
            // the LAST of the bunch, we must save info somewhere else,
            // say last_conflicting_overlap, and check twice.
            // Then again, the third match (which is the last to overlap
            // with the first) would overlap with the fourth.

            // So the "return as many non overlapping matches as possible"
            // is actually accomplished by doing NOTHING in this branch of the IF.
        }
        else
        {
            // Not an overlap, so store it.
            indexes[++index] = i+1;
            if (index == max_indexes) // Too many matches already found?
                break; // Stop searching and return found so far
        }
        // Adapt i and j to keep searching
        i += patlen + 1;
        j = patlen - 1;
        continue;
    }

if（j<0）{
//我们在i+1找到了一个patlen匹配
//是重叠吗？
if（索引和（索引[index]+patlen

Boyer Moore的算法利用了这样一个事实，即当在一个较长的字符串中搜索“HELLO WORLD”时，在给定位置找到的字母限制了在该位置周围可以找到的内容，如果要找到匹配的话，这有点像海战游戏：如果你在距离边界的四个单元格处找到公海，你不需要测试剩下的四个细胞，以防有一个5细胞载体隐藏在那里；不可能