Python 使用sklearn获取每个ngram术语的频率
我使用以下方法从熊猫数据帧中提取NGRAM:Python 使用sklearn获取每个ngram术语的频率,python,scikit-learn,Python,Scikit Learn,我使用以下方法从熊猫数据帧中提取NGRAM: def extractNGrams(df, ngram_size, min_freq): """Extract NGrams from a list of Strings Keyword arguments: df -- the pandas dataframe containing the sentences ngram_size -- defining the n for ngrams min_freq -
def extractNGrams(df, ngram_size, min_freq):
"""Extract NGrams from a list of Strings
Keyword arguments:
df -- the pandas dataframe containing the sentences
ngram_size -- defining the n for ngrams
min_freq --- the minimum frequency for the ngram to be part of the set
"""
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=min_freq)
lstSentences = df['Text'].values.tolist()
X_train_counts = vect.fit_transform(lstSentences)
vocab = vect.get_feature_names()
#print (vocab)
print (X_train_counts.shape)
return vocab
我想了解如何获取每个ngram术语的频率?发布我用于获取计数的代码
train_data_features = X_train_counts.toarray()
vocab = vect.get_feature_names()
dist = np.sum(train_data_features, axis=0)
ngram_freq = {}
# For each, print the vocabulary word and the frequency
for tag, count in zip(vocab, dist):
#print(tag, count)
ngram_freq[tag]=count
在定义的vocab变量中,术语和特征索引之间存在映射。例如{“word1”:0,“word2”:1}。您需要的频率由变量X_train_计数的非零项给出。也就是说,如果第一列的值为2,则“word1”出现两次。这有用吗?@geompalik知道了。。!!这有帮助。。!!谢谢不要使用
.toarray()
,因为这会将稀疏矩阵转换为密集矩阵。只需保留它,即第一行是否应该是列车数据\u特性=X\u列车计数
?