Algorithm 计算传入字符流中某个单词的出现次数_Algorithm_Data Structures

Algorithm 计算传入字符流中某个单词的出现次数

algorithm data-structures

Algorithm 计算传入字符流中某个单词的出现次数,algorithm,data-structures,Algorithm,Data Structures,我在一次采访中被问到这个问题，虽然我擅长DS&Algo，但这个问题我无法解决。不管怎么说，这是一个有趣的问题，所以请发布它问题：您有一个传入的字符流，需要计算单词的出现次数。您只能从流中读取一个API，即stream.next_char（），如果没有API，则返回“\0” int count_occurrences(Stream stream, String word) { // you have only one function provided from Stream class tha

我在一次采访中被问到这个问题，虽然我擅长DS&Algo，但这个问题我无法解决。不管怎么说，这是一个有趣的问题，所以请发布它

问题：您有一个传入的字符流，需要计算单词的出现次数。您只能从流中读取一个API，即stream.next_char（），如果没有API，则返回“\0”

int count_occurrences(Stream stream, String word) {
// you have only one function provided from Stream class that you can use to 
// read one char at a time, no length/size etc.
// stream.next_char() - return "\0" if end
}

输入：“aabckjhabcc” 单词：“abc”

输出：2

最简单的解决方案是使用最多包含word.length（）符号的缓冲区：

复杂性是O（N*M），内存是O（M）

可能是这样的

int count_occurrences(Stream stream, String word) {
    // you have only one function provided from Stream class that you can use to 
    // read one char at a time, no length/size etc.
    // stream.next_char() - return "\0" if end

    List<int> positions = new List<int>();

    int counter = 0;
    while (true) {
        char ch = stream.next_char();
        if (ch == '\0') return counter;

        if (ch == word.charAt(0)) {
            positions.add(0);
        }

        int i = 0;
        while (i < positions.length) {
            int pos = positions[i];

            if (word.charAt(pos) != ch) {
                positions.remove(i);
                continue;
            }

            pos++;
            if (pos == word.length()) {
                positions.remove(i);
                counter++;
                continue;
            }

            positions[i] = pos;
            i++;
        }
    }
}

int count\u出现次数（流、字符串字）{
//流类只提供了一个函数，可以用来
//一次读取一个字符，无长度/大小等。
//stream.next\u char（）-如果结束，则返回“\0”
列表位置=新列表（）；
int计数器=0；
while（true）{
char ch=stream.next_char（）；
if（ch=='\0'）返回计数器；
if（ch==word.charAt（0））{
位置。添加（0）；
}
int i=0；
而（i

他们要找的（可能）不是拉宾·卡普就是克努斯·莫里斯·普拉特。两者都需要一次传球，开销非常小。如果模式很大，他们将在速度方面取得明显的胜利，因为复杂性是

O（流长度）

Rabbin Karp依赖于可以在O（1）中为下一个字符更新的哈希。如果散列不是很好，或者流很长（散列冲突），可能会给您带来误报

Knuth Morris Pratt重新测试计算最长前缀的长度，该前缀也是模式中每个位置的后缀。这需要O（n）内存来存储这些结果，但仅此而已

请在wikipedia的字符串模式匹配下查找更多细节和实现。

我认为这个问题与使用有限状态计算模型匹配字符串

这个问题可以通过使用KMP字符串来解决匹配算法

KMP算法尝试在模式的文本字符串中查找匹配项字符串，考虑模式的前缀有多少即使我们在某个点上发现不匹配，仍然匹配

用于确定“仍可以匹配多少前缀”，如果在模式中匹配到索引i后，我们遇到不匹配，故障函数是预先建立的。（请参考以下代码。）用于建立故障函数值）

该故障函数将告知模式的每个索引i，即使我们在索引i之后遇到了不匹配

这样做的目的是找出模式的最长正确前缀的长度这也是由1到i表示的模式的每个子串的后缀指数，其中i的范围为1到n

我使用字符串索引从1开始

因此，任何模式的第一个字符的故障函数值是0。（即到目前为止没有匹配的字符）

对于后续字符，对于每个索引i=2到n，我们看到最长的长度是多少模式[1…i]的子字符串的正确前缀，它也是模式[1…i]的子字符串的后缀

假设我们的模式是“aac”，那么索引1为0（尚未匹配），且故障函数值对于索引2，其长度为1，（最长的正确前缀的长度与 “aa”的最长正确后缀为1）

对于模式“ababac”，索引1的故障函数值为0，索引2为0，索引3为1（因为第三个索引“a”与指数4的第一个指数“a”）是2（因为指数1和2的“ab”是相同的指数3和4中的“ab”，指数5中的“aba”为3（“aba”在指数[1…3]中）与指数[3…5]中的“aba”相同。对于索引6，故障函数值为0

下面是构建故障函数和匹配的代码（C++）使用它的文本（或流）：

/* Assuming that string indices start from 1 for both pattern and text. */
/* Firstly build the failure function. */
int s = 1;
int t = 0;  

/* n denotes the length of the pattern */
int *f = new int[n+1];
f[1] = 0;   

for (s = 1; s < n; s++) {
    while (t > 0 && pattern[t + 1] != pattern[s + 1]) {
        t = f[t];
    }
    if (pattern[t + 1] == pattern[s + 1]) {
        t++;
        f[s + 1] = t;
    }
    else {
        f[s + 1] = 0;           
    }
}

/* Now start reading characters from the stream */
int count = 0;
char current_char = stream.next_char();

/* s denotes the index of pattern upto which we have found match in text */
/* initially its 0 i.e. no character has been matched yet. */
s = 0; 
while (current_char != '\0') {

    /* IF previously, we had matched upto a certain number of
       characters, and then failed, we return s to the point
       which is the longest prefix that still might be matched.

       (spaces between string are just for clarity)
       For e.g., if pattern is              "a  b  a  b  a  a" 
       & its failure returning index is     "0  0  1  2  3  1"

       and we encounter 
       the text like :      "x  y  z  a  b  a  b  a  b  a  a" 
              indices :      1  2  3  4  5  6  7  8  9  10 11

       after matching the substring "a  b  a  b  a", starting at
       index 4 of text, we are successful upto index 8  but we fail
       at index 9, the next character at index 9 of text is 'b'
       but in our pattern which should have been 'a'.Thus, the index
       in pattern which has been matched till now is 5 ( a  b  a  b  a)
                                                         1  2  3  4  5
       Now, we see that the failure returning index at index 5 of 
       pattern is 3, which means that the text is still matched upto
       index 3 of pattern (a  b  a), not from the initial starting 
       index 4 of text, but starting from index 6 of text.

       Thus we continue searching assuming that our next starting index
       in text is 6, eventually finding the match starting from index 6
       upto index 11.    

       */
        while (s > 0 && current_char != pattern[s + 1]) {
            s = f[s];
        }
        if (current_char == pattern[s + 1]) s++; /* We test the next character after the currently
                                                    matched portion of pattern with the current 
                                                    character of text , if it matches, we increase
                                                    the size of our matched portion by 1*/
        if (s == n) {
            count++;
        }
        current_char = stream.next_char();
}

printf("Count is %d\n", count);

/*假设模式和文本的字符串索引都从1开始*/
/*首先建立失效函数*/
int s=1；
int t=0；
/*n表示图案的长度*/
int*f=新的int[n+1]；
f[1]=0；
对于（s=1；s0&&pattern[t+1]！=pattern[s+1]）{
t=f[t]；
}
如果（模式[t+1]==模式[s+1]）{
t++；
f[s+1]=t；
}
否则{
f[s+1]=0；
}
}
/*现在开始从流中读取字符*/
整数计数=0；
char current_char=stream.next_char（）；
/*s表示我们在文本中找到匹配的模式索引*/
/*最初为0，即尚未匹配任何字符*/
s=0；
while（当前字符！='\0'）{
/*如果之前我们匹配了一定数量的
字符，然后失败，我们返回到点
这是仍然可以匹配的最长前缀。
（字符串之间的空格仅为清晰起见）
例如，如果模式为“a b a a”
&其故障返回索引为“0 0 1 2 3 1”
我们遇到
文本如：“x y z a b a b a a b a a”
指数：12345678991011
匹配子字符串“a”后，从
文本的索引4，我们成功达到索引8，但我们失败了
在索引9处，文本索引9处的下一个字符是“b”
但在我们的模式中，应该是“a”。因此，索引
按部就班
int count_occurrences(Stream stream, String word) {
    // you have only one function provided from Stream class that you can use to 
    // read one char at a time, no length/size etc.
    // stream.next_char() - return "\0" if end

    List<int> positions = new List<int>();

    int counter = 0;
    while (true) {
        char ch = stream.next_char();
        if (ch == '\0') return counter;

        if (ch == word.charAt(0)) {
            positions.add(0);
        }

        int i = 0;
        while (i < positions.length) {
            int pos = positions[i];

            if (word.charAt(pos) != ch) {
                positions.remove(i);
                continue;
            }

            pos++;
            if (pos == word.length()) {
                positions.remove(i);
                counter++;
                continue;
            }

            positions[i] = pos;
            i++;
        }
    }
}

/* Assuming that string indices start from 1 for both pattern and text. */
/* Firstly build the failure function. */
int s = 1;
int t = 0;  

/* n denotes the length of the pattern */
int *f = new int[n+1];
f[1] = 0;   

for (s = 1; s < n; s++) {
    while (t > 0 && pattern[t + 1] != pattern[s + 1]) {
        t = f[t];
    }
    if (pattern[t + 1] == pattern[s + 1]) {
        t++;
        f[s + 1] = t;
    }
    else {
        f[s + 1] = 0;           
    }
}

/* Now start reading characters from the stream */
int count = 0;
char current_char = stream.next_char();

/* s denotes the index of pattern upto which we have found match in text */
/* initially its 0 i.e. no character has been matched yet. */
s = 0; 
while (current_char != '\0') {

    /* IF previously, we had matched upto a certain number of
       characters, and then failed, we return s to the point
       which is the longest prefix that still might be matched.

       (spaces between string are just for clarity)
       For e.g., if pattern is              "a  b  a  b  a  a" 
       & its failure returning index is     "0  0  1  2  3  1"

       and we encounter 
       the text like :      "x  y  z  a  b  a  b  a  b  a  a" 
              indices :      1  2  3  4  5  6  7  8  9  10 11

       after matching the substring "a  b  a  b  a", starting at
       index 4 of text, we are successful upto index 8  but we fail
       at index 9, the next character at index 9 of text is 'b'
       but in our pattern which should have been 'a'.Thus, the index
       in pattern which has been matched till now is 5 ( a  b  a  b  a)
                                                         1  2  3  4  5
       Now, we see that the failure returning index at index 5 of 
       pattern is 3, which means that the text is still matched upto
       index 3 of pattern (a  b  a), not from the initial starting 
       index 4 of text, but starting from index 6 of text.

       Thus we continue searching assuming that our next starting index
       in text is 6, eventually finding the match starting from index 6
       upto index 11.    

       */
        while (s > 0 && current_char != pattern[s + 1]) {
            s = f[s];
        }
        if (current_char == pattern[s + 1]) s++; /* We test the next character after the currently
                                                    matched portion of pattern with the current 
                                                    character of text , if it matches, we increase
                                                    the size of our matched portion by 1*/
        if (s == n) {
            count++;
        }
        current_char = stream.next_char();
}

printf("Count is %d\n", count);