Python 3.x 熊猫专栏_Python 3.x_Pandas_Nlp_Nltk_Trigram

Python 3.x 熊猫专栏

python-3.x pandas nlp

Python 3.x 熊猫专栏,python-3.x,pandas,nlp,nltk,trigram,Python 3.x,Pandas,Nlp,Nltk,Trigram,我有一个熊猫数据框架，包含以下列：第1栏 ['if', 'you', 'think', 'she', "'s", 'cute', 'now', ',', 'you', 'should', 'have', 'see', 'her', 'a', 'couple', 'of', 'year', 'ago', '.'] ['uh', ',', 'yeah', '.', 'just', 'a', 'fax', '.'] 第2栏 if you think she 's cute now , you sh

我有一个熊猫数据框架，包含以下列：

第1栏

['if', 'you', 'think', 'she', "'s", 'cute', 'now', ',', 'you', 'should', 'have', 'see', 'her', 'a', 'couple', 'of', 'year', 'ago', '.']
['uh', ',', 'yeah', '.', 'just', 'a', 'fax', '.']

第2栏

if you think she 's cute now , you should have see her a couple of year ago .
uh , yeah . just a fax .

等等

我的目标是计算数据帧的bigram、trigram和quadragram（特别是第2列，它已经被元素化了）

我尝试了以下方法：

import nltk
from nltk import bigrams
from nltk import trigrams

trig = trigrams(df ["Column2"])
print (trig)

但是，我有以下错误

<generator object trigrams at 0x0000013C757F1C48>

我的最终目标是能够打印最上面的X个双图、三角图等。

使用列表理解和

拆分

并首先对所有三角图进行展平：

df = pd.DataFrame({'Column2':["if you think she cute now you if uh yeah just",
                              'you think she uh yeah just a fax']}) 

from nltk import trigrams

L = [x for x in df['Column2'] for x in trigrams(x.split())]
print (L)
[('if', 'you', 'think'), ('you', 'think', 'she'), ('think', 'she', 'cute'), 
 ('she', 'cute', 'now'), ('cute', 'now', 'you'), ('now', 'you', 'if'), 
 ('you', 'if', 'uh'), ('if', 'uh', 'yeah'), ('uh', 'yeah', 'just'), 
 ('you', 'think', 'she'), ('think', 'she', 'uh'), ('she', 'uh', 'yeah'),
 ('uh', 'yeah', 'just'), ('yeah', 'just', 'a'), ('just', 'a', 'fax')]

然后通过以下方式计算元组：

对于最高值，请使用：

预期的输出是什么？前x个三角形的列表（例如前10个三角形）：三角形1:450三角形2:345等

from collections import Counter
c = Counter(L)
print (c)
Counter({('you', 'think', 'she'): 2, ('uh', 'yeah', 'just'): 2, ('if', 'you', 'think'): 1,
         ('think', 'she', 'cute'): 1, ('she', 'cute', 'now'): 1, ('cute', 'now', 'you'): 1,
         ('now', 'you', 'if'): 1, ('you', 'if', 'uh'): 1, ('if', 'uh', 'yeah'): 1, 
         ('think', 'she', 'uh'): 1, ('she', 'uh', 'yeah'): 1, 
         ('yeah', 'just', 'a'): 1, ('just', 'a', 'fax'): 1})

top = c.most_common(5)
print (top)
[(('you', 'think', 'she'), 2), (('uh', 'yeah', 'just'), 2), 
 (('if', 'you', 'think'), 1), (('think', 'she', 'cute'), 1),
 (('she', 'cute', 'now'), 1)]