C n-gram是所有单词中最常见的一个_C_Algorithm_N Gram

C n-gram是所有单词中最常见的一个

c algorithm

C n-gram是所有单词中最常见的一个,c,algorithm,n-gram,C,Algorithm,N Gram,我遇到了以下编程面试问题：挑战1：N克 N-gram是给定单词中N个连续字符的序列。“pilot”一词有三个3克：“pil”、“ilo”和“lot”。对于给定的一组单词和n-gram长度你的任务是 • write a function that finds the n-gram that is the most frequent one among all the words • print the result to the standard output (stdout) • if t

我遇到了以下编程面试问题：

挑战1：N克

N-gram是给定单词中N个连续字符的序列。“pilot”一词有三个3克：“pil”、“ilo”和“lot”。对于给定的一组单词和n-gram长度你的任务是

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

请注意，函数将接收以下参数：

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram

数据约束

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

• your function is expected to print the result in less than 2 seconds

效率约束

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

• your function is expected to print the result in less than 2 seconds

范例输入文字：“aaaab a0a baab c”

输出aaa 长度：3

解释

对于上述输入，按频率排序的3克为：

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

如果我只有一个小时的时间来解决这个问题，并且我选择使用C语言来解决这个问题：实现一个哈希表来计算N-gram的频率和时间量是一个好主意吗？因为在C库中没有哈希表的实现

如果是的话，我正在考虑使用带有有序链表的单独链接实现一个哈希表。这些实现减少了您必须解决问题的时间

这是最快的选择吗

谢谢你

如果实现效率很重要，而您使用的是C，我将初始化一个指向字符串中n-gram开头的指针数组，使用

qsort

根据指针所属的n-gram对指针进行排序，然后循环该排序数组并计算计数

这应该执行得足够快，而且不需要编写任何花哨的数据结构。

您可以将trigram转换为RADIX50代码。看

在radix50中，三角形的输出值与16位无符号int值相匹配

此后，您可以使用基数编码的三角形作为数组中的索引

因此，您的代码如下所示：

uint16_t counters[1 << 16]; // 64K counters

bzero(counters, sizeof(counters));

for(const char *p = txt; p[2] != 0; p++) 
  counters[radix50(p)]++;

uint16\t计数器[1很抱歉发布python，但我会这么做：
你可能会对算法有一些想法。注意，这个程序解决了一个数量级以上的单词
from itertools import groupby

someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3

ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
    for i in range(len(word)-n+1):
        ngrams.append(word[i:i+n])
        # you could inline all logic here
        # add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word

ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])

ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)

popular_ngrams = []

for freq in ngrams_freq_sorted:
    if freq[0] == ngrams_freq_sorted[0][0]:
        popular_ngrams.append(freq[1])
    else:
        break

print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**

因此，解决这个问题的基本方法是：
查找字符串中的所有n-gram
将所有重复条目映射到一个新结构中，该结构具有n-gram及其出现次数
这里可以找到我的C++解决方案：
鉴于：
const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };

然后，对于第一步-这是代码：
const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) { 
    //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
    if (noSpace(ptr)) {
        //safely copy our current n-gram over to the ngrams array
        //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
        //are valid letters
        for (int i=0; i<NGRAM; i++) {
            ngrams[idx][i] = ptr[i];
        }
        ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
        idx++;
    }
    ptr++;
}

const char*ptr=str；
int-idx=0；
//不终止的检查ptr[0]到ptr[NGRAM-1]不是“\0”
while（notTerminated（ptr））{
//noSpace检查ptr[0]到ptr[NGRAM-1]是isalpha（）
if（无空间（ptr））{
//安全地将当前n-gram复制到ngrams阵列
//我们在迭代ptr，因为我们在这里，我们知道ptr和下一个NGRAM空间
//这些是有效的信件
为了（int i=0；i只是为了好玩，我编写了一个SQL版本（SQL Server 2012）：
按需产量
------------------------------
aaa
3

如果您不必使用C，我已经在大约10分钟内编写了这个Python脚本，它处理1.5Mb的文件，包含超过265000个单词在0.4s中查找3克（除了在屏幕上打印值）

用于测试的文本是詹姆斯·乔伊斯的《尤利西斯》，你可以在这里免费找到



这里的单词分隔符既是空格
又是回车符\n

import sys

text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
    word_len = len(word)
    if word_len < ngram_len:
        continue
    for i in range(0, (word_len - ngram_len) + 1):
        ngram = word[i:i+ngram_len]
        if ngram in ngrams:
            ngrams[ngram] += 1
        else:
            ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
        if val not in ngrams_by_freq:
                ngrams_by_freq[val] = [key]
        else:
                ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
        print('{} with frequency of {}'.format(key[1:], key[0]))

导入系统
text=open（sys.argv[1]，'r'）.read（）
ngram_len=int（sys.argv[2]）
text=text.replace（'\n'，''）
words=[word.lower（）表示文本中的word.split（“”）]
ngrams={}
用文字表示：
单词_len=len（单词）
如果单词
您可以在O（nk）时间内解决此问题，其中n是单词数，k是每个单词的平均n克数
您认为哈希表是解决问题的好方法，这是正确的
但是，由于您编写解决方案的时间有限，我建议您使用链接列表，而不是使用链接列表。实现可能更简单：如果遇到冲突，您只需沿着列表走得更远
另外，请确保为哈希表分配足够的内存：大约是预期n-gram数量的两倍就可以了
your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3

for str in str_list:
    start = 0
    end = ngram_len
    len_word = len(str)
    for i in range(0,len_word):
        if end <= len_word :
            if str_hash.get(str[start:end]):              
                str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
            else:
                str_hash[str[start:end]] = 1
            start = start +1
            end = end +1
        else:
            break

keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
    print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])

your_str=“aaaab a0a baab c”
str\u list=您的str.split（“”）
str_hash={}
ngram_len=3
对于str_列表中的str：
开始=0
结束=ngram_len
len_word=len（str）
对于范围内的i（0，len_字）：
如果实际的编码面试到此结束？你确定二叉树（比如AVL）不能完成这项工作吗？你最多需要3克吗？有（26+26+10）^3=238328可能只有字母数字字符的3个字符，因此直接向上的LUT看起来是可行的。我会在单个数组中提前分配所需的桶数（这是可能的，因为您对文本长度有一个上限），并且只在哈希表中存储指向它们的指针。使用move to front/insert at back启发式可以加快哈希表的检索速度。并在末尾对数组进行排序。使用树在实践中比较慢。想想看。在1000个字符的文本中，有多少个3克？这是不区分大小写的a-Z、0-9、空格、美元、点和未定义。足以进行计算文本字符串的三角图。但是您现在不需要预先设置ngramLength
参数，这在很大程度上取决于n=3
这一事实。三角图的聪明解决方案，从radix50开始，为计数器中的每个c设置
your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3

for str in str_list:
    start = 0
    end = ngram_len
    len_word = len(str)
    for i in range(0,len_word):
        if end <= len_word :
            if str_hash.get(str[start:end]):              
                str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
            else:
                str_hash[str[start:end]] = 1
            start = start +1
            end = end +1
        else:
            break

keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
    print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])