Python NLTK：使用BigramCollabonFinder从数据帧中的文本字段显示常用短语（ngrams）的频率_Python_Pandas_Nltk_Frequency_Word

Python NLTK：使用BigramCollabonFinder从数据帧中的文本字段显示常用短语（ngrams）的频率

python pandas

Python NLTK：使用BigramCollabonFinder从数据帧中的文本字段显示常用短语（ngrams）的频率,python,pandas,nltk,frequency,word,Python,Pandas,Nltk,Frequency,Word,我有以下标记化数据帧示例： No category problem_definition_stopwords 175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'] 211 1438 ['galley', 'work', 'table', 'stuck'] 912 2698 ['cloth', 'stuck'] 572 2521 ['stuc

我有以下标记化数据帧示例：

No  category    problem_definition_stopwords
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

我成功地运行了下面的代码以获得ngram短语

finder = BigramCollocationFinder.from_documents(df['problem_definition_stopwords'])

# only bigrams that appear 1+ times
finder.apply_freq_filter(1) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

结果如下所示，pmi排名前10位：

[('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]

我希望上面的结果出现在一个包含频率计数的数据框中，显示这些双随机数发生的频率

所需输出示例：

ngram                    frequency
'brewing', 'properly'    1
'galley', 'work'         1
'maker', 'brewing'       1
'properly', '2'          1
...                      ...

如何在Python中执行上述操作

这应该可以

首先，设置数据集（或类似数据集）：

使用

nltk.ngrams

重新创建ngrams列表：

ngram_list = [pair for row in s for pair in ngrams(row, 2)]

使用

collections.Counter

计算每个ngram在整个语料库中出现的次数：

counts = Counter(ngram_list).most_common()

构建一个看起来像您想要的数据框架：

pd.DataFrame.from_records(counts, columns=['gram', 'count'])
                   gram  count
0            (420, 420)      2
1       (coffee, maker)      1
2      (maker, brewing)      1
3   (brewing, properly)      1
4         (properly, 2)      1
5              (2, 420)      1
6        (galley, work)      1
7         (work, table)      1
8        (table, stuck)      1
9        (cloth, stuck)      1
10      (stuck, coffee)      1

然后，您可以进行筛选，只查看由您的

finder.nbest

调用生成的ngram：

df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df[df['gram'].isin(result)]

抱歉，忘了提一下，如果我想让结果按“类别”字段分组怎么办？您指的是哪些结果？也许值得再问一个问题，或者大致调查

pandas.DataFrame.groupby

方法是如何工作的。这是对我有效的答案，我尝试了很多其他方法。因此，它与当前的pandas/nltk包兼容。谢谢@blacksite

df = pd.DataFrame.from_records(counts, columns=['gram', 'count'])
df[df['gram'].isin(result)]