Python 使用sklearn获取每个ngram术语的频率_Python_Scikit Learn

Python 使用sklearn获取每个ngram术语的频率

python scikit-learn

Python 使用sklearn获取每个ngram术语的频率,python,scikit-learn,Python,Scikit Learn,我使用以下方法从熊猫数据帧中提取NGRAM： def extractNGrams(df, ngram_size, min_freq): """Extract NGrams from a list of Strings Keyword arguments: df -- the pandas dataframe containing the sentences ngram_size -- defining the n for ngrams min_freq -

我使用以下方法从熊猫数据帧中提取NGRAM：

def extractNGrams(df, ngram_size, min_freq):
    """Extract NGrams from a list of Strings
    Keyword arguments:
    df -- the pandas dataframe containing the sentences
    ngram_size -- defining the n for ngrams
    min_freq --- the minimum frequency for the ngram to be part of the set
    """
    vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=min_freq)
    lstSentences = df['Text'].values.tolist()
    X_train_counts = vect.fit_transform(lstSentences)    
    vocab = vect.get_feature_names()
    #print (vocab)
    print (X_train_counts.shape)
    return vocab

我想了解如何获取每个ngram术语的频率？

发布我用于获取计数的代码

train_data_features = X_train_counts.toarray()
vocab = vect.get_feature_names()
dist = np.sum(train_data_features, axis=0)
ngram_freq = {}

# For each, print the vocabulary word and the frequency
for tag, count in zip(vocab, dist):
    #print(tag, count)
    ngram_freq[tag]=count

在定义的vocab变量中，术语和特征索引之间存在映射。例如{“word1”：0，“word2”：1}。您需要的频率由变量X_train_计数的非零项给出。也就是说，如果第一列的值为2，则“word1”出现两次。这有用吗？@geompalik知道了。。！！这有帮助。。！！谢谢不要使用

.toarray（）

，因为这会将稀疏矩阵转换为密集矩阵。只需保留它，即第一行是否应该是

列车数据\u特性=X\u列车计数

？