K-Means集群-输出集群包含相同数量的元素,但顺序不同[Python]

K-Means集群-输出集群包含相同数量的元素,但顺序不同[Python],python,cluster-analysis,k-means,data-mining,Python,Cluster Analysis,K Means,Data Mining,接下来,我对包含单个单词的列表执行K-Means聚类。这是一个以板球为基础的项目,所以我选择了K=3,以便以后我可以将这三个集群区分为[击球、保龄球、防守]。但是,在编译代码之后,结果3个集群中的元素都是相同的,但顺序不同。我试图使初始列表清晰,但它也无法解决问题。附上下面的代码 from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans import numpy a

接下来,我对包含单个单词的列表执行K-Means聚类。这是一个以板球为基础的项目,所以我选择了K=3,以便以后我可以将这三个集群区分为[击球、保龄球、防守]。但是,在编译代码之后,结果3个集群中的元素都是相同的,但顺序不同。我试图使初始列表清晰,但它也无法解决问题。附上下面的代码

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

len(finaldatatext)
#2173
vectorizer = TfidfVectorizer(stop_words='english')
#finaldatatext here is the list containing distinct elements
X = vectorizer.fit_transform(finaldatatext)

true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

clusterlists = []
for i in range(true_k):
  dummy_list  = []
    for ind in order_centroids[i]:
      #print( '%s' % terms[ind])
      dummy_list.append('%s' % terms[ind])
  clusterlists.append(dummy_list)
初始列表示例为:

['anymore', 'silly', 'fielders', 'fans', 'rcb', 'precedent', 'reputation', 'pool', 'International', 'famous', 'Astle', 'max', 'stadium', 'bennet', 'working', 'lassi', 'ameetasinh', 'meantime', 'com', 'on', 'little', 'saini', 'Kanos', 'telling', 'six', 'PrithviShaw', 'started', 'letting', 'wYB2P72Il2', 'chess', 'brainwashed', 'Stat', 'mediocre', 'Afridi', 'hopes', 'strength', 'jamieson', 'managed', '46th', 'finale', 'PaRtNeRShIP', 'Another', 'kind', 'exactly', 'Happybirthday', 'out', 'RidaNajamKhan', 'scoreline', 'Career', 'boiiiiiiiiiiiii', 'based', 'starting', 'Test', 'omnipresent', 'Hahaha', 'version', 'victory', 'desert', 'cowards', 'OUTDATED', 'nz', 'inspecting', 'honestly', 'wait', 'Unless', 'steadying', 'think', 'anyone', 'YER', 'rant', 'one', 'odis', 'BANTER', 'paav', 'Ug6cTFgG8U', 'aggressive', 'brought', 'workload', 'Wise', 'ca', 'Brilliant', 'twist', 'open', 'THROWS', 'bringing', 'till', 'starts', 'gives', 'wYB', 'fifty', 'SENA', 'baboon', 'punishment', 'summarized', 'feeling', 'pandya', 'Bangladesh', 'hurting', 'accent', 'Kid', 'well']
预期结果是三个不同的集群具有独特的值,我可以根据元素将其分为击球、保龄球和防守。目前,它是三个相同的集群在不同的顺序

print(Clusterlists[0])
#sample reduced result
['absence', 'zize6kysq2', 'flexibility', 'finally', 'finals', 'fined', 'finisher', 'firepower', 'fit', 'fitness', 'flaw', 'flaws', 'fleming', 'fluffed', 'frame', 'fluke', 'fn0uegxgss', 'focussed', 'foot', 'forget', 'forgot', 'form', 'format', 'forward', 'fought', 'fow', 'finale', 'final', 'filter', 'figures', 'fashioned', 'fast', 'fastest', 'fat', 'fatigue', 'fault', 'fav', 'featured', 'feel', 'feeling', 'feels', 'fees', 'feet', 'felt', 'ferguson', 'fewest', 'ffc4pfbvfr', 'ffs', 'field', 'fielder', 'fielders', 'fielding', 'fight', 'fow_hundreds', 'frankly', 'faridabad', 'given', 'giving', 'glad', 'glenn', 'gloves', 'god', 'gods', 'goes', 'going', 'gois', 'gon', 'gone', 'good', 'got', 'grand', 'grandhomme', 'grandmom', 'grandpa', 'grass', 'great', 'greatest', 'greatness', 'greig', 'grind', 'gives', 'gingers', 'free', 'gill', 'frontline','fulfilling', 'future', 'gaandu', 'gabbar', 'gajal_dalmia', 'gambhir', 'game', 'gangsta', 'geez', 'gem', 'genius', 'genuinely', 'gets', 'getter', 'getting', 'giant', 'giddy', 'fascinating', 'fared', 'groupby', 'drives', 'dropped', 'drowning', 'dube', 'dude', 'dumb', 'dumbass', 'duo', 'e3cli7hakf', 'e9fhdkxvvl', 'earlier', 'early', 'earned', 'easiest', 'easily', 'easy', 'economically', 'economy', 'edengarden', 'edge']
len(Clusterlists[0])
#1728
len(Clusterlists[1])
#1728
len(Clusterlists[2])
#1728
当前提供相同的值。请提供解决方案。提前谢谢

您的“ClusterList”仅在代码末尾追加一次。尝试更正“ClusterList”的缩进,应该可以

而且,原始帖子中的缩进看起来也有问题。复制粘贴后检查缩进。

您的“ClusterList”仅在代码末尾追加一次。尝试更正“ClusterList”的缩进,应该可以


而且,原始帖子中的缩进看起来也有问题。在复制和粘贴后检查缩进。

不久前,我测试了一些代码来进行文本聚类。计算文本之间的距离有点不合常规,但如果您真的愿意,您可以这样做

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

只需修改它以满足您的特定需要。

不久前,我测试了一些代码来进行文本聚类。计算文本之间的距离有点不合常规,但如果您真的愿意,您可以这样做

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

只需根据您的具体需要修改即可。

否。。很抱歉复制代码时,这是一个错误。初始代码缩进正确。否。。很抱歉复制代码时,这是一个错误。初始代码缩进正确。