C n-gram是所有单词中最常见的一个
我遇到了以下编程面试问题: 挑战1:N克 N-gram是给定单词中N个连续字符的序列。“pilot”一词有三个3克:“pil”、“ilo”和“lot”。 对于给定的一组单词和n-gram长度 你的任务是C n-gram是所有单词中最常见的一个,c,algorithm,n-gram,C,Algorithm,N Gram,我遇到了以下编程面试问题: 挑战1:N克 N-gram是给定单词中N个连续字符的序列。“pilot”一词有三个3克:“pil”、“ilo”和“lot”。 对于给定的一组单词和n-gram长度 你的任务是 • write a function that finds the n-gram that is the most frequent one among all the words • print the result to the standard output (stdout) • if t
• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)
请注意,函数将接收以下参数:
• text
○ which is a string containing words separated by whitespaces
• ngramLength
○ which is an integer value giving the length of the n-gram
数据约束
• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
• your function is expected to print the result in less than 2 seconds
效率约束
• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
• your function is expected to print the result in less than 2 seconds
范例
输入
文字:“aaaab a0a baab c”
输出aaa
长度:3
解释
对于上述输入,按频率排序的3克为:
• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1
如果我只有一个小时的时间来解决这个问题,并且我选择使用C语言来解决这个问题:实现一个哈希表来计算N-gram的频率和时间量是一个好主意吗?因为在C库中没有哈希表的实现
如果是的话,我正在考虑使用带有有序链表的单独链接实现一个哈希表。这些实现减少了您必须解决问题的时间
这是最快的选择吗
谢谢你 如果实现效率很重要,而您使用的是C,我将初始化一个指向字符串中n-gram开头的指针数组,使用
qsort
根据指针所属的n-gram对指针进行排序,然后循环该排序数组并计算计数
这应该执行得足够快,而且不需要编写任何花哨的数据结构。您可以将trigram转换为RADIX50代码。 看 在radix50中,三角形的输出值与16位无符号int值相匹配 此后,您可以使用基数编码的三角形作为数组中的索引 因此,您的代码如下所示:
uint16_t counters[1 << 16]; // 64K counters
bzero(counters, sizeof(counters));
for(const char *p = txt; p[2] != 0; p++)
counters[radix50(p)]++;
uint16\t计数器[1很抱歉发布python,但我会这么做:
你可能会对算法有一些想法。注意,这个程序解决了一个数量级以上的单词
from itertools import groupby
someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions "
someText *= 40000
print len(someText)
n = 3
ngrams = []
for word in filter(lambda x: len(x) >= n, someText.split(" ")):
for i in range(len(word)-n+1):
ngrams.append(word[i:i+n])
# you could inline all logic here
# add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word
ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))])
ngrams_freq_sorted = sorted(ngrams_freq, reverse=True)
popular_ngrams = []
for freq in ngrams_freq_sorted:
if freq[0] == ngrams_freq_sorted[0][0]:
popular_ngrams.append(freq[1])
else:
break
print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0]
# > 2560000
# > Most popular ngram: aaa
# > [Finished in 1.3s]**
因此,解决这个问题的基本方法是:
查找字符串中的所有n-gram
将所有重复条目映射到一个新结构中,该结构具有n-gram及其出现次数
这里可以找到我的C++解决方案:
鉴于:
const unsigned int MAX_STR_LEN = 250000;
const unsigned short NGRAM = 3;
const unsigned int NGRAMS = MAX_STR_LEN-NGRAM;
//we will need a maximum of "the length of our string" - "the length of our n-gram"
//places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0'
char ngrams[NGRAMS][NGRAM+1] = { 0 };
然后,对于第一步-这是代码:
const char *ptr = str;
int idx = 0;
//notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0'
while (notTerminated(ptr)) {
//noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha()
if (noSpace(ptr)) {
//safely copy our current n-gram over to the ngrams array
//we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces
//are valid letters
for (int i=0; i<NGRAM; i++) {
ngrams[idx][i] = ptr[i];
}
ngrams[idx][NGRAM] = '\0'; //important to zero-terminate
idx++;
}
ptr++;
}
const char*ptr=str;
int-idx=0;
//不终止的检查ptr[0]到ptr[NGRAM-1]不是“\0”
while(notTerminated(ptr)){
//noSpace检查ptr[0]到ptr[NGRAM-1]是isalpha()
if(无空间(ptr)){
//安全地将当前n-gram复制到ngrams阵列
//我们在迭代ptr,因为我们在这里,我们知道ptr和下一个NGRAM空间
//这些是有效的信件
为了(int i=0;i只是为了好玩,我编写了一个SQL版本(SQL Server 2012):
按需产量
------------------------------
aaa
3
如果您不必使用C,我已经在大约10分钟内编写了这个Python脚本,它处理1.5Mb的文件,包含超过265000个单词在0.4s中查找3克(除了在屏幕上打印值)
用于测试的文本是詹姆斯·乔伊斯的《尤利西斯》,你可以在这里免费找到
这里的单词分隔符既是空格
又是回车符\n
import sys
text = open(sys.argv[1], 'r').read()
ngram_len = int(sys.argv[2])
text = text.replace('\n', ' ')
words = [word.lower() for word in text.split(' ')]
ngrams = {}
for word in words:
word_len = len(word)
if word_len < ngram_len:
continue
for i in range(0, (word_len - ngram_len) + 1):
ngram = word[i:i+ngram_len]
if ngram in ngrams:
ngrams[ngram] += 1
else:
ngrams[ngram] = 1
ngrams_by_freq = {}
for key, val in ngrams.items():
if val not in ngrams_by_freq:
ngrams_by_freq[val] = [key]
else:
ngrams_by_freq[val].append(key)
ngrams_by_freq = sorted(ngrams_by_freq.items())
for key in ngrams_by_freq:
print('{} with frequency of {}'.format(key[1:], key[0]))
导入系统
text=open(sys.argv[1],'r').read()
ngram_len=int(sys.argv[2])
text=text.replace('\n','')
words=[word.lower()表示文本中的word.split(“”)]
ngrams={}
用文字表示:
单词_len=len(单词)
如果单词
您可以在O(nk)时间内解决此问题,其中n是单词数,k是每个单词的平均n克数
您认为哈希表是解决问题的好方法,这是正确的
但是,由于您编写解决方案的时间有限,我建议您使用链接列表,而不是使用链接列表。实现可能更简单:如果遇到冲突,您只需沿着列表走得更远
另外,请确保为哈希表分配足够的内存:大约是预期n-gram数量的两倍就可以了
your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3
for str in str_list:
start = 0
end = ngram_len
len_word = len(str)
for i in range(0,len_word):
if end <= len_word :
if str_hash.get(str[start:end]):
str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
else:
str_hash[str[start:end]] = 1
start = start +1
end = end +1
else:
break
keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])
your_str=“aaaab a0a baab c”
str\u list=您的str.split(“”)
str_hash={}
ngram_len=3
对于str_列表中的str:
开始=0
结束=ngram_len
len_word=len(str)
对于范围内的i(0,len_字):
如果实际的编码面试到此结束?你确定二叉树(比如AVL)不能完成这项工作吗?你最多需要3克吗?有(26+26+10)^3=238328可能只有字母数字字符的3个字符,因此直接向上的LUT看起来是可行的。我会在单个数组中提前分配所需的桶数(这是可能的,因为您对文本长度有一个上限),并且只在哈希表中存储指向它们的指针。使用move to front/insert at back启发式可以加快哈希表的检索速度。并在末尾对数组进行排序。使用树在实践中比较慢。想想看。在1000个字符的文本中,有多少个3克?这是不区分大小写的a-Z、0-9、空格、美元、点和未定义。足以进行计算文本字符串的三角图。但是您现在不需要预先设置ngramLength
参数,这在很大程度上取决于n=3
这一事实。三角图的聪明解决方案,从radix50开始,为计数器中的每个c设置
your_str = "aaaab a0a baaab c"
str_list = your_str.split(" ")
str_hash = {}
ngram_len = 3
for str in str_list:
start = 0
end = ngram_len
len_word = len(str)
for i in range(0,len_word):
if end <= len_word :
if str_hash.get(str[start:end]):
str_hash[str[start:end]] = str_hash.get(str[start:end]) + 1
else:
str_hash[str[start:end]] = 1
start = start +1
end = end +1
else:
break
keys_sorted =sorted(str_hash.items())
for ngram in sorted(keys_sorted,key= lambda x : x[1],reverse = True):
print "\"%s\" with a frequency of %s" % (ngram[0],ngram[1])