Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/algorithm/11.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C n-gram是所有单词中最常见的一个_C_Algorithm_N Gram - Fatal编程技术网

C n-gram是所有单词中最常见的一个

C n-gram是所有单词中最常见的一个,c,algorithm,n-gram,C,Algorithm,N Gram,我遇到了以下编程面试问题: 挑战1:N克 N-gram是给定单词中N个连续字符的序列。“pilot”一词有三个3克:“pil”、“ilo”和“lot”。 对于给定的一组单词和n-gram长度 你的任务是 • write a function that finds the n-gram that is the most frequent one among all the words • print the result to the standard output (stdout) • if t

我遇到了以下编程面试问题:

挑战1:N克

N-gram是给定单词中N个连续字符的序列。“pilot”一词有三个3克:“pil”、“ilo”和“lot”。 对于给定的一组单词和n-gram长度 你的任务是

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)
请注意,函数将接收以下参数:

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram
数据约束

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
• your function is expected to print the result in less than 2 seconds
效率约束

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
• your function is expected to print the result in less than 2 seconds
范例 输入 文字:“aaaab a0a baab c”

输出aaa 长度:3

解释

对于上述输入,按频率排序的3克为:

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1
如果我只有一个小时的时间来解决这个问题,并且我选择使用C语言来解决这个问题:实现一个哈希表来计算N-gram的频率和时间量是一个好主意吗?因为在C库中没有哈希表的实现

如果是的话,我正在考虑使用带有有序链表的单独链接实现一个哈希表。这些实现减少了您必须解决问题的时间

这是最快的选择吗


谢谢你

如果实现效率很重要,而您使用的是C,我将初始化一个指向字符串中n-gram开头的指针数组,使用
qsort
根据指针所属的n-gram对指针进行排序,然后循环该排序数组并计算计数


这应该执行得足够快,而且不需要编写任何花哨的数据结构。

您可以将trigram转换为RADIX50代码。 看

在radix50中,三角形的输出值与16位无符号int值相匹配

此后,您可以使用基数编码的三角形作为数组中的索引

因此,您的代码如下所示:

uint16_t counters[1 << 16]; // 64K counters

bzero(counters, sizeof(counters));

for(const char *p = txt; p[2] != 0; p++) 
  counters[radix50(p)]++;

uint16\t计数器[1很抱歉发布python,但我会这么做:
你可能会对算法有一些想法。注意,这个程序解决了一个数量级以上的单词

from itertools import groupby

someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3

ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
    for i in range(len(word)-n+1):
        ngrams.append(word[i:i+n])
        # you could inline all logic here
        # add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word

ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])

ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)

popular_ngrams = []

for freq in ngrams_freq_sorted:
    if freq[0] == ngrams_freq_sorted[0][0]:
        popular_ngrams.append(freq[1])
    else:
        break

print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**

因此,解决这个问题的基本方法是:

  • 查找字符串中的所有n-gram
  • 将所有重复条目映射到一个新结构中,该结构具有n-gram及其出现次数
  • 这里可以找到我的C++解决方案:

    鉴于:

    const unsigned int MAX_STR_LEN = 250000;
    const unsigned short NGRAM = 3;
    const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
    //we will need a maximum of "the length of our string" - "the length of our n-gram"
    //places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
    char ngrams[NGRAMS][NGRAM+1] = { 0 };
    
    然后,对于第一步-这是代码:

    const char *ptr = str;
    int idx = 0;
    //notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
    while (notTerminated(ptr)) { 
        //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
        if (noSpace(ptr)) {
            //safely copy our current n-gram over to the ngrams array
            //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
            //are valid letters
            for (int i=0; i<NGRAM; i++) {
                ngrams[idx][i] = ptr[i];
            }
            ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
            idx++;
        }
        ptr++;
    }
    
    const char*ptr=str;
    int-idx=0;
    //不终止的检查ptr[0]到ptr[NGRAM-1]不是“\0”
    while(notTerminated(ptr)){
    //noSpace检查ptr[0]到ptr[NGRAM-1]是isalpha()
    if(无空间(ptr)){
    //安全地将当前n-gram复制到ngrams阵列
    //我们在迭代ptr,因为我们在这里,我们知道ptr和下一个NGRAM空间
    //这些是有效的信件
    
    为了(int i=0;i只是为了好玩,我编写了一个SQL版本(SQL Server 2012):

    按需产量

    ------------------------------
    aaa
    3
    

    如果您不必使用C,我已经在大约10分钟内编写了这个Python脚本,它处理1.5Mb的文件,包含超过265000个单词在0.4s中查找3克(除了在屏幕上打印值)
    用于测试的文本是詹姆斯·乔伊斯的《尤利西斯》,你可以在这里免费找到

    这里的单词分隔符既是
    空格
    又是回车符
    \n

    import sys
    
    text = open(sys.argv[1], 'r').read()
    ngram_len = int(sys.argv[2])
    text = text.replace('\n', ' ')
    words = [word.lower() for word in text.split(' ')]
    ngrams = {}
    for word in words:
        word_len = len(word)
        if word_len < ngram_len:
            continue
        for i in range(0, (word_len - ngram_len) + 1):
            ngram = word[i:i+ngram_len]
            if ngram in ngrams:
                ngrams[ngram] += 1
            else:
                ngrams[ngram] = 1
    ngrams_by_freq = {}
    for key, val in ngrams.items():
            if val not in ngrams_by_freq:
                    ngrams_by_freq[val] = [key]
            else:
                    ngrams_by_freq[val].append(key)
    ngrams_by_freq = sorted(ngrams_by_freq.items())
    for key in ngrams_by_freq:
            print('{} with frequency of {}'.format(key[1:], key[0]))
    
    导入系统 text=open(sys.argv[1],'r').read() ngram_len=int(sys.argv[2]) text=text.replace('\n','') words=[word.lower()表示文本中的word.split(“”)] ngrams={} 用文字表示: 单词_len=len(单词) 如果单词
    您可以在O(nk)时间内解决此问题,其中n是单词数,k是每个单词的平均n克数

    您认为哈希表是解决问题的好方法,这是正确的

    但是,由于您编写解决方案的时间有限,我建议您使用链接列表,而不是使用链接列表。实现可能更简单:如果遇到冲突,您只需沿着列表走得更远


    另外,请确保为哈希表分配足够的内存:大约是预期n-gram数量的两倍就可以了

    your_str = "aaaab a0a baaab c"
    str_list = your_str.split(" ")
    str_hash = {}
    ngram_len = 3
    
    for str in str_list:
        start = 0
        end = ngram_len
        len_word = len(str)
        for i in range(0,len_word):
            if end <= len_word :
                if str_hash.get(str[start:end]):              
                    str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
                else:
                    str_hash[str[start:end]] = 1
                start = start +1
                end = end +1
            else:
                break
    
    keys_sorted =sorted(str_hash.items())
    for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
        print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])
    
    your_str=“aaaab a0a baab c”
    str\u list=您的str.split(“”)
    str_hash={}
    ngram_len=3
    对于str_列表中的str:
    开始=0
    结束=ngram_len
    len_word=len(str)
    对于范围内的i(0,len_字):
    
    如果实际的编码面试到此结束?你确定二叉树(比如AVL)不能完成这项工作吗?你最多需要3克吗?有(26+26+10)^3=238328可能只有字母数字字符的3个字符,因此直接向上的LUT看起来是可行的。我会在单个数组中提前分配所需的桶数(这是可能的,因为您对文本长度有一个上限),并且只在哈希表中存储指向它们的指针。使用move to front/insert at back启发式可以加快哈希表的检索速度。并在末尾对数组进行排序。使用树在实践中比较慢。想想看。在1000个字符的文本中,有多少个3克?这是不区分大小写的a-Z、0-9、空格、美元、点和未定义。足以进行计算文本字符串的三角图。但是您现在不需要预先设置
    ngramLength
    参数,这在很大程度上取决于
    n=3
    这一事实。三角图的聪明解决方案,从radix50开始,为计数器中的每个c设置
    
    
    your_str = "aaaab a0a baaab c"
    str_list = your_str.split(" ")
    str_hash = {}
    ngram_len = 3
    
    for str in str_list:
        start = 0
        end = ngram_len
        len_word = len(str)
        for i in range(0,len_word):
            if end <= len_word :
                if str_hash.get(str[start:end]):              
                    str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
                else:
                    str_hash[str[start:end]] = 1
                start = start +1
                end = end +1
            else:
                break
    
    keys_sorted =sorted(str_hash.items())
    for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
        print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])