Python NLTK:使用BigramCollabonFinder从数据帧中的文本字段显示常用短语(ngrams)的频率
我有以下标记化数据帧示例:Python NLTK:使用BigramCollabonFinder从数据帧中的文本字段显示常用短语(ngrams)的频率,python,pandas,nltk,frequency,word,Python,Pandas,Nltk,Frequency,Word,我有以下标记化数据帧示例: No category problem_definition_stopwords 175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'] 211 1438 ['galley', 'work', 'table', 'stuck'] 912 2698 ['cloth', 'stuck'] 572 2521 ['stuc
No category problem_definition_stopwords
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
我成功地运行了下面的代码以获得ngram短语
finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])
# only bigrams that appear 1+ times
finder.apply_freq_filter(1)
# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)
结果如下所示,pmi排名前10位:
[('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]
我希望上面的结果出现在一个包含频率计数的数据框中,显示这些双随机数发生的频率
所需输出示例:
ngram frequency
'brewing', 'properly' 1
'galley', 'work' 1
'maker', 'brewing' 1
'properly', '2' 1
... ...
如何在Python中执行上述操作 这应该可以
首先,设置数据集(或类似数据集):
使用nltk.ngrams
重新创建ngrams列表:
ngram_list = [pair for row in s for pair in ngrams(row, 2)]
使用collections.Counter
计算每个ngram在整个语料库中出现的次数:
counts = Counter(ngram_list).most_common()
构建一个看起来像您想要的数据框架:
pd.DataFrame.from_records(counts, columns=['gram', 'count'])
gram count
0 (420, 420) 2
1 (coffee, maker) 1
2 (maker, brewing) 1
3 (brewing, properly) 1
4 (properly, 2) 1
5 (2, 420) 1
6 (galley, work) 1
7 (work, table) 1
8 (table, stuck) 1
9 (cloth, stuck) 1
10 (stuck, coffee) 1
然后,您可以进行筛选,只查看由您的finder.nbest
调用生成的ngram:
df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df[df['gram'].isin(result)]
抱歉,忘了提一下,如果我想让结果按“类别”字段分组怎么办?您指的是哪些结果?也许值得再问一个问题,或者大致调查
pandas.DataFrame.groupby
方法是如何工作的。这是对我有效的答案,我尝试了很多其他方法。因此,它与当前的pandas/nltk包兼容。谢谢@blacksite
df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df[df['gram'].isin(result)]