C++ 在（任意大的）流中搜索精确的字符串匹配-C++；_C++_Algorithm_String Matching

C++ 在（任意大的）流中搜索精确的字符串匹配-C++；

c++ algorithm

C++ 在（任意大的）流中搜索精确的字符串匹配-C++；,c++,algorithm,string-matching,C++,Algorithm,String Matching,我正在为字符串匹配构建一个简单的多服务器。我使用sockets和select同时处理多个客户机。服务器所做的唯一工作是：客户端连接到服务器，并通过网络套接字以流的形式发送针（大小小于10GB）和草堆（任意大小）。针和草堆是任意的二进制数据服务器需要在干草堆中搜索所有出现的针（作为精确的字符串匹配），并将大量针匹配发送回客户端。服务器需要动态处理客户机，并且能够在合理的时间内处理任何输入（也就是说，搜索算法必须具有线性时间复杂度）要做到这一点，我显然需要将草堆分成一小部分（可能比针小），以便在

我正在为字符串匹配构建一个简单的多服务器。我使用sockets和select同时处理多个客户机。服务器所做的唯一工作是：客户端连接到服务器，并通过网络套接字以流的形式发送针（大小小于10GB）和草堆（任意大小）。针和草堆是任意的二进制数据

服务器需要在干草堆中搜索所有出现的针（作为精确的字符串匹配），并将大量针匹配发送回客户端。服务器需要动态处理客户机，并且能够在合理的时间内处理任何输入（也就是说，搜索算法必须具有线性时间复杂度）

要做到这一点，我显然需要将草堆分成一小部分（可能比针小），以便在它们通过网络插座时进行处理。也就是说，我需要一个能够处理字符串的搜索算法，这个字符串被分成几个部分并在其中搜索，就像strstrstr（…）一样

<>我找不到任何标准的C或C++库函数，也没有找到一个可以按顺序处理字符串的升压库对象。如果我没有弄错的话，strstr（）、string.find（）和Boost search/knuth_morris_pratt.hpp中的算法只能在整个干草堆都在连续的内存块中时处理搜索。或者有什么技巧，我可以用它来按我丢失的部分搜索字符串？你们知道有哪一个C/C++库能够处理这么大的针头和干草堆吗。能够处理干草堆流或在干草堆中按部件搜索

通过谷歌搜索，我没有发现任何有用的库，因此我被迫创建自己的Knuth-Morris-Pratt算法变体，该算法能够记住自己的状态（如下所示）。然而，我并不认为这是一个最优的解决方案，因为在我看来，一个经过良好调优的字符串搜索算法肯定会表现得更好，而且以后调试时也不会那么担心

所以我的问题是：除了创建自己的搜索算法之外，还有什么更优雅的方法可以在大海捞针中逐部分搜索吗？如何使用标准的C字符串库来实现这一点，有什么诀窍吗？是否有专门用于此类任务的C/C++库

以下是我的midified KMP算法的（部分）代码：

#include <cstdlib>
#include <cstring>
#include <cstdio>

class knuth_morris_pratt {
    const char* const needle;
    const size_t needle_len;
    const int* const lps; // a longest proper suffix table (skip table)

// suffix_len is an ofset of a longest haystack_part suffix matching with
// some prefix of the needle. suffix_len myst be shorter than needle_len.
// Ofset is defined as a last matching character in a needle.
    size_t suffix_len;
    size_t match_count; // a number of needles found in haystack

public:
    inline knuth_morris_pratt(const char* needle, size_t len) : 
            needle(needle), needle_len(len),
            lps( build_lps_array() ), suffix_len(0),
            match_count(len == 0 ? 1 : 0)    {  }
    inline ~knuth_morris_pratt() {  free((void*)lps);   }

    void search_part(const char* haystack_part, size_t hp_len); // processes a given part of the haystack stream
    inline size_t get_match_count() { return match_count; }

private:
    const int* build_lps_array();
};

// Worst case complexity: linear space, linear time

// see: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
// see article: KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., 1977, Fast pattern matching in strings
void knuth_morris_pratt::search_part(const char* haystack_part, size_t hp_len) {
    if(needle_len == 0) {
        match_count += hp_len;
        return;
    }

    const char* hs = haystack_part;

    size_t i = 0; // index for txt[]
    size_t j = suffix_len; // index for pat[]
    while (i < hp_len) {
        if (needle[j] == hs[i]) {
            j++;
            i++;
        }

        if (j == needle_len) {
            // a needle found
            match_count++;
            j = lps[j - 1];
        }
        else if (i < hp_len && needle[j] != hs[i]) {
            // Do not match lps[0..lps[j-1]] characters,
            // they will match anyway
            if (j != 0)
                j = lps[j - 1];
            else
                i = i + 1;
        }
    }

    suffix_len = j;
}

const int* knuth_morris_pratt::build_lps_array() {
    int* const new_lps = (int*)malloc(needle_len);
//    check_cond_fatal(new_lps != NULL, "Unable to alocate memory in knuth_morris_pratt(..)");

    // length of the previous longest prefix suffix
    size_t len = 0;
    new_lps[0] = 0; // lps[0] is always 0

    // the loop calculates lps[i] for i = 1 to M-1
    size_t i = 1;
    while (i < needle_len) {
        if (needle[i] == needle[len]) {
            len++;
            new_lps[i] = len;
            i++;
        }
        else // (pat[i] != pat[len])
        {
            // This is tricky. Consider the example.
            // AAACAAAA and i = 7. The idea is similar
            // to search step.
            if (len != 0) {
                len = new_lps[len - 1];

                // Also, note that we do not increment
                // i here
            }
            else // if (len == 0)
            {
                new_lps[i] = 0;
                i++;
            }
        }
    }

    return new_lps;
}


int main() 
{
    const char* needle = "lorem";
    const char* p1 = "sit voluptatem accusantium doloremque laudantium qui dolo";
    const char* p2 = "rem ipsum quia dolor sit amet";
    const char* p3 = "dolorem eum fugiat quo voluptas nulla pariatur?";
    knuth_morris_pratt searcher(needle, strlen(needle));
    searcher.search_part(p1, strlen(p1));
    searcher.search_part(p2, strlen(p2));
    searcher.search_part(p3, strlen(p3));

    printf("%d \n", (int)searcher.get_match_count());

    return 0;
}

#包括
#包括
#包括
努特·莫里斯·普拉特班{
常量字符*常量指针；
常量大小针长度；
const int*const lps；//最长的正确后缀表（跳过表）
//后缀_len是一组最长的haystack_部分后缀，与
//针的某些前缀。后缀不能比针短。
//Ofset定义为针中的最后一个匹配字符。
大小后缀长度；
大小与计数不符；//在干草堆中发现了许多针
公众：
内联knuth_morris_pratt（常量字符*针，尺寸长度）：
打捆针（打捆针），打捆针（打捆针），
lps（build_lps_array（）），后缀_len（0），
匹配计数（len==0？1:0）{
内联~knuth_morris_pratt（）{free（（void*）lps）；}
void search_part（const char*haystack_part，size_t hp_len）；//处理haystack流的给定部分
内联大小\u t获取\u匹配\u计数（）{返回匹配\u计数；}
私人：
常量int*构建lps数组（）；
};
//最坏情况复杂性：线性空间、线性时间
//见：https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
//参见文章：KNUTH D.E.，MORRIS（Jr）J.H.，PRATT V.R.，1977，字符串中的快速模式匹配
void knuth\u morris\u pratt:：搜索部分（常量字符*干草堆部分，大小\u hp\u len）{
如果（指针长度==0）{
匹配计数+=hp长度；
返回；
}
const char*hs=干草堆部分；
size\u t i=0；//txt[]的索引
size\u t j=后缀\u len；//pat的索引[]
而（i

您可以
bool contains(const std::string & str, const std::string & pattern)
{
    bool found(false);

    if(!pattern.empty() && (pattern.length() < str.length()))
    {
        for(size_t i = 0; !found && (i <= str.length()-pattern.length()); ++i)
        {
            if((str[i] == pattern[0]) && (str.substr(i, pattern.length()) == pattern))
            {
                found = true;
            }
        }
    }

    return found;
}

std::string pattern("something");            // The pattern we want to find
std::string end_of_previous_packet("");      // The first part of overlapping section
std::string beginning_of_current_packet(""); // The second part of overlapping section

std::string overlap;                         // The string to store the overlap at each iteration

bool found(false);

while(!found && !all_data_received())          // stop condition
{
    // Get the current packet
    std::string packet = receive_part();

    // Set the beginning of the current packet
    beginning_of_current_packet = packet.substr(0, pattern.length());

    // Build the overlap
    overlap = end_of_previous_packet + beginning_of_current_packet;

    // If the overlap or the packet contains the pattern, we found a match
    if(contains(overlap, pattern) || contains(packet, pattern))
        found = true;

    // Set the end of previous packet for the next iteration
    end_of_previous_packet = packet.substr(packet.length()-pattern.length());
}