Python 3.x Kmeans唯一的单词taggs
我想从K-Means聚类中获得一个唯一标记的列表。我有以下代码:Python 3.x Kmeans唯一的单词taggs,python-3.x,k-means,tagging,Python 3.x,K Means,Tagging,我想从K-Means聚类中获得一个唯一标记的列表。我有以下代码: def cluster_tagging(variable_a_taggear): document = result[variable_a_taggear] vectorizer = TfidfVectorizer(ngram_range=(1, 5)) X = vectorizer.fit_transform(document) true_k = 180 puntos2= true_k if model_setting
def cluster_tagging(variable_a_taggear):
document = result[variable_a_taggear]
vectorizer = TfidfVectorizer(ngram_range=(1, 5))
X = vectorizer.fit_transform(document)
true_k = 180
puntos2= true_k
if model_setting == 'MiniBatchKMeans':
#model = MiniBatchKMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
pass
elif model_setting == 'KMeans':
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=10000000, n_init=1)
model.fit(X)
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
#print(terms[:8])
cluster_ = []
key_ = []
ID = []
cluster_col = 'Cluster_%s'%(variable_a_taggear)
keywords_col = 'Keywords_%s'%(variable_a_taggear)
word_cloud = pd.DataFrame(columns=[cluster_col, keywords_col])
for i in range(puntos2):
print('Cluster %s:' % (i))
cluster_.append(i)
key_1 = []
key_1 = list(set(key_1))
key_.append(key_1)
for ind in order_centroids[i, :8]:
print('%s' % terms[ind])
terms_ = terms[ind]
key_1.append(terms_)
print('first key_', key_)
info = {cluster_col:cluster_,keywords_col:key_}
word_cloud = pd.DataFrame(info)
word_cloud.head()
#print('Prediction')
predicted = model.predict(vectorizer.transform(document))
lst2 = result['Ticket ID']
predictions = pd.DataFrame(list(zip(predicted, lst2)), columns =[cluster_col, 'Ticket ID'])
#predictions = pd.DataFrame(predicted,result['Ticket ID'])
predictions.columns = [cluster_col, 'Ticket ID']
#print(predictions)
resultado = pd.merge(predictions, word_cloud, left_on=cluster_col, right_on=cluster_col, how='inner')
print(resultado.head())
return resultado
正如你通过n-gram观察到的那样,我获得了作为不同n-gram的一部分的重复单词。例如,对于一个集群,我有以下标记:[[fecha-iniciar',iniciar',modificar-fecha-iniciar-cc',proceder-modificar-fecha-iniciar',proceder-modificar-fecha-inciar-cc',fecha-iniciar-cc',fecha']
如何获得每个集群的唯一单词列表
谢谢问题:如何获得每个集群的唯一单词列表
您可以使用分隔句子中的单词和numpy.unique
在数组中获取唯一值
import numpy as np
from nltk.tokenize import word_tokenize
cluster_tags = ['fecha iniciar', 'iniciar', ..., 'fecha']
one_string = ' '.join(cluster_tags)
np.unique(word_tokenize(one_string))
如果您确定所有单词总是用一个干净的空格分隔,您可以简单地拆分它们
np.unique(' '.join(cluster_tags).split())
奖金提示:
如果你愿意,你可以计算每个单词的频率
# See answer by Max Malysh: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
from collections import Counter
from pandas.core.common import flatten
tokenized = [word_tokenize(text) for text in cluster_tags]
Counter(flatten(tokenized))
问题:如何获得每个集群的唯一单词列表
您可以使用分隔句子中的单词和numpy.unique
在数组中获取唯一值
import numpy as np
from nltk.tokenize import word_tokenize
cluster_tags = ['fecha iniciar', 'iniciar', ..., 'fecha']
one_string = ' '.join(cluster_tags)
np.unique(word_tokenize(one_string))
如果您确定所有单词总是用一个干净的空格分隔,您可以简单地拆分它们
np.unique(' '.join(cluster_tags).split())
奖金提示:
如果你愿意,你可以计算每个单词的频率
# See answer by Max Malysh: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
from collections import Counter
from pandas.core.common import flatten
tokenized = [word_tokenize(text) for text in cluster_tags]
Counter(flatten(tokenized))