Python 如何使用matplotlib绘制Kmeans文本聚类结果?

Python 如何使用matplotlib绘制Kmeans文本聚类结果?,python,matplotlib,machine-learning,scikit-learn,Python,Matplotlib,Machine Learning,Scikit Learn,我有以下代码将一些示例文本与scikit learn进行集群 train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"] vect = TfidfVectorizer() X = vect.fit_transform(train) clf =

我有以下代码将一些示例文本与scikit learn进行集群

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()
X = vect.fit_transform(train)
clf = KMeans(n_clusters=3)
clf.fit(X)
centroids = clf.cluster_centers_

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=80, linewidths=5)
plt.show()
我搞不懂的是如何绘制聚类结果。X是csr_矩阵。我想要的是每个要绘制的结果的(x,y)坐标


Ty

你的tf idf矩阵最终是3 x 17,所以你需要做一些投影或降维来获得二维的质心。你有几个选择;以下是t-SNE的一个示例:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog",
     "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()  
X = vect.fit_transform(train)
random_state = 1
clf = KMeans(n_clusters=3, random_state = random_state)
data = clf.fit(X)
centroids = clf.cluster_centers_

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity,
         early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_centroids = model.fit_transform(centroids)
print transformed_centroids
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='x')
plt.show()

在你的例子中,如果你使用PCA初始化你的t-SNE,你会得到大间距的质心;如果你使用随机初始化,你会得到很小的质心和一张无趣的图片。

这里有一个更长更好的答案,有更多的数据:

import matplotlib.pyplot as plt
from numpy import concatenate
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

train = [
    'In 1917 a German Navy flight crashed at/near Off western Denmark with 18 aboard',
    # 'There were 18 passenger/crew fatalities',
    'In 1942 a Deutsche Lufthansa flight crashed at an unknown location with 4 aboard',
    # 'There were 4 passenger/crew fatalities',
    'In 1946 Trans Luxury Airlines flight 878 crashed at/near Moline, Illinois with 25 aboard',
    # 'There were 2 passenger/crew fatalities',
    'In 1947 a Slick Airways flight crashed at/near Hanksville, Utah with 3 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1949 a Royal Canadian Air Force flight crashed at/near Near Bigstone Lake, Manitoba with 21 aboard',
    'There were 21 passenger/crew fatalities',
    'In 1952 a Airwork flight crashed at/near Off Trapani, Italy with 57 aboard',
    'There were 7 passenger/crew fatalities',
    'In 1963 a Aeroflot flight crashed at/near Near Leningrad, Russia with 52 aboard',
    'In 1966 a Alaska Coastal Airlines flight crashed at/near Near Juneau, Alaska with 9 aboard',
    'There were 9 passenger/crew fatalities',
    'In 1986 a Air Taxi flight crashed at/near Frenchglen, Oregon with 6 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1989 a Air Taxi flight crashed at/near Gold Beach, Oregon with 3 aboard',
    'There were 18 passenger/crew fatalities',
    'In 1990 a Republic of China Air Force flight crashed at/near Yunlin, Taiwan with 18 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1992 a Servicios Aereos Santa Ana flight crashed at/near Colorado, Bolivia with 10 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1994 Royal Air Maroc flight 630 crashed at/near Near Agadir, Morocco with 44 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1995 Atlantic Southeast Airlines flight 529 crashed at/near Near Carrollton, GA with 29 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1998 a Lumbini Airways flight crashed at/near Near Ghorepani, Nepal with 18 aboard',
    'There were 18 passenger/crew fatalities',
    'In 2004 a Venezuelan Air Force flight crashed at/near Near Maracay, Venezuela with 25 aboard',
    'There were 25 passenger/crew fatalities',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train)
n_clusters = 2
random_state = 1
clf = KMeans(n_clusters=n_clusters, random_state=random_state)
data = clf.fit(X)
centroids = clf.cluster_centers_
# we want to transform the rows and the centroids
everything = concatenate((X.todense(), centroids))

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 10
model = TSNE(n_components=2, random_state=random_state, init=tsne_init,
    perplexity=tsne_perplexity,
    early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_everything = model.fit_transform(everything)
print(transformed_everything)
plt.scatter(transformed_everything[:-n_clusters, 0], transformed_everything[:-n_clusters, 1], marker='x')
plt.scatter(transformed_everything[-n_clusters:, 0], transformed_everything[-n_clusters:, 1], marker='o')

plt.show()
数据中有两个明显的聚类:一个是对车祸的描述,另一个是死亡人数的总结。注释掉行并稍微上下调整集群大小是很容易的。编写的代码应该显示两个蓝色的簇,一个大一个小,有两个橙色的质心。数据项比标记多:一些数据行被转换到空间中的相同点上


最后,较小的t-SNE学习率似乎会产生更紧密的聚类。

我想知道这是否也可以打印(在相同的分散点上)语料库中靠近每个质心的相应文档@迈克Delong@AndreaMoro是的,看起来这是可能的。t-SNE模型没有单独的拟合和变换方法;如果是这样的话,我们可以在原始数据X上进行拟合,转换原始数据和质心,然后用不同的标记将它们标绘在一起。因此,我们需要从k-means模型中提取质心,将它们与X连接在一起,然后通过t-SNE模型将整个过程传递到一起,然后分别绘制X行和质心行。看起来t-SNE模型保留了行顺序,所以这非常简单。谢谢。我不确定我是否理解连接部分。这不会混淆信息吗?也许你可以修改你的答案来涵盖这一点?@AndreaMoro请看下面。谢谢。最后一个。使用基于sklearn.decomposition的不同方法时出现的困惑。然而,我注意到降低维度也需要重新装配模型,而集群化后来有所不同。我现在怀疑这是否正确,考虑到这一减少——至少在我的情况下——只是为了策划。