Python 使用sklearn获取每个ngram术语的频率

Python 使用sklearn获取每个ngram术语的频率,python,scikit-learn,Python,Scikit Learn,我使用以下方法从熊猫数据帧中提取NGRAM: def extractNGrams(df, ngram_size, min_freq): """Extract NGrams from a list of Strings Keyword arguments: df -- the pandas dataframe containing the sentences ngram_size -- defining the n for ngrams min_freq -

我使用以下方法从熊猫数据帧中提取NGRAM:

def extractNGrams(df, ngram_size, min_freq):
    """Extract NGrams from a list of Strings
    Keyword arguments:
    df -- the pandas dataframe containing the sentences
    ngram_size -- defining the n for ngrams
    min_freq --- the minimum frequency for the ngram to be part of the set
    """
    vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=min_freq)
    lstSentences = df['Text'].values.tolist()
    X_train_counts = vect.fit_transform(lstSentences)    
    vocab = vect.get_feature_names()
    #print (vocab)
    print (X_train_counts.shape)
    return vocab

我想了解如何获取每个ngram术语的频率?

发布我用于获取计数的代码

train_data_features = X_train_counts.toarray()
vocab = vect.get_feature_names()
dist = np.sum(train_data_features, axis=0)
ngram_freq = {}

# For each, print the vocabulary word and the frequency
for tag, count in zip(vocab, dist):
    #print(tag, count)
    ngram_freq[tag]=count

在定义的vocab变量中,术语和特征索引之间存在映射。例如{“word1”:0,“word2”:1}。您需要的频率由变量X_train_计数的非零项给出。也就是说,如果第一列的值为2,则“word1”出现两次。这有用吗?@geompalik知道了。。!!这有帮助。。!!谢谢不要使用
.toarray()
,因为这会将稀疏矩阵转换为密集矩阵。只需保留它,即第一行是否应该是
列车数据\u特性=X\u列车计数