Python 在应用PCA和聚类方法后,如何找到每个聚类中的数据索引?

Python 在应用PCA和聚类方法后,如何找到每个聚类中的数据索引?,python,data-science,cluster-analysis,k-means,pca,Python,Data Science,Cluster Analysis,K Means,Pca,我想将绘图的每个点排序到表中相应的索引 下面是我处理的前三行数据 在将PCA(2D)应用于我的数据之后,我得到了一个类似于此的图,然后我使用kmeans算法进行聚类 . 假设我获取红色集群中的所有数据。如何在表中找到这些数据集的索引? 或者如何找到表中每个点的索引? 目标是能够将绘图中的每个点排序到表中相应的线。我使用Python 欢迎发表文章或文件来解释这一点。提前谢谢。您可以在每个簇中找到点的索引,如下所示 假设数据具有12个样本,每个样本具有三个特征。PCA用于减少特征的数量。然后,使用

我想将绘图的每个点排序到表中相应的索引
下面是我处理的前三行数据



在将PCA(2D)应用于我的数据之后,我得到了一个类似于此的图,然后我使用kmeans算法进行聚类

.
假设我获取红色集群中的所有数据。如何在表中找到这些数据集的索引?
或者如何找到表中每个点的索引?
目标是能够将绘图中的每个点排序到表中相应的线。我使用Python

欢迎发表文章或文件来解释这一点。提前谢谢。

您可以在每个簇中找到点的索引,如下所示

假设数据具有12个样本,每个样本具有三个特征。PCA用于减少特征的数量。然后,使用k-means对数据进行聚类

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import decomposition

data=np.array([[5,4.5,6],[1,1,2],[1,2,1.5],[8,8.5,8],[2,2.5,1.5],[4,4.5,5],[8.5,8,9],[4,6,5.5],[5,6,5],[8,9,8.5],[9,9,8],[9,8,9]])

pca = decomposition.PCA(n_components=2) # apply pca
pca.fit(data)
data2 = pca.transform(data) # new data saved in data2
print("data with two features:\n", data2)
plt.plot(data2[:,0],data2[:,1],'ro')
for i in range(data2.shape[0]):        
        plt.text(data2[i,0],data2[i,1], str(i), fontsize=12)
plt.show()

#run Kmeans on data2 with 3 clusters
km=KMeans(n_clusters=3) # number of clusters =3 
km=km.fit(data2)
cluster_labels=km.labels_ # get cluster label of all data
print("cluster labels of points:", cluster_labels)

# get indexes of points in each cluster 
#Note: you can use these indexes in both data and data2
index_cluster_0=np.where(cluster_labels==0)[0] # get indexes of points in cluster 0 
print("indexes of points in cluster 0:", index_cluster_0)
index_cluster_1=np.where(cluster_labels==1)[0] # get indexes of points in cluster 1
print("indexes of points in cluster 1:", index_cluster_1)
index_cluster_2=np.where(cluster_labels==2)[0] # get indexes of points in cluster 2
print("indexes of points in cluster 2:", index_cluster_2)

#plot the results
plt.plot(data2[index_cluster_0,0],data2[index_cluster_0,1],'ro') #samples in cluster 0 are red
plt.plot(data2[index_cluster_1,0],data2[index_cluster_1,1],'bo') #samples in cluster 1 are blue
plt.plot(data2[index_cluster_2,0],data2[index_cluster_2,1],'go') #samples in cluster 2 are green
plt.title('Cluster 0: red, Cluster 1: blue, Cluster 2: green')
plt.show()
输出:

 data with two features:
 [[ 0.78528736  1.06750913]
 [ 7.42907267  0.7576433 ]
 [ 7.15249919 -0.32552917]
 [-4.40158551 -0.37613828]
 [ 6.26503561 -0.55127462]
 [ 1.95747537  0.30362524]
 [-4.9903965   0.69673761]
 [ 0.83759652 -0.57439496]
 [ 0.51143086 -0.70659728]
 [-4.96288343 -0.46968141]
 [-5.28904909 -0.60188372]
 [-5.29448305  0.77998415]]


请注意,您可以在原始数据和PCA获得的数据中使用获得的索引。

您可以在每个簇中找到点的索引,如下所示

假设数据具有12个样本,每个样本具有三个特征。PCA用于减少特征的数量。然后,使用k-means对数据进行聚类

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import decomposition

data=np.array([[5,4.5,6],[1,1,2],[1,2,1.5],[8,8.5,8],[2,2.5,1.5],[4,4.5,5],[8.5,8,9],[4,6,5.5],[5,6,5],[8,9,8.5],[9,9,8],[9,8,9]])

pca = decomposition.PCA(n_components=2) # apply pca
pca.fit(data)
data2 = pca.transform(data) # new data saved in data2
print("data with two features:\n", data2)
plt.plot(data2[:,0],data2[:,1],'ro')
for i in range(data2.shape[0]):        
        plt.text(data2[i,0],data2[i,1], str(i), fontsize=12)
plt.show()

#run Kmeans on data2 with 3 clusters
km=KMeans(n_clusters=3) # number of clusters =3 
km=km.fit(data2)
cluster_labels=km.labels_ # get cluster label of all data
print("cluster labels of points:", cluster_labels)

# get indexes of points in each cluster 
#Note: you can use these indexes in both data and data2
index_cluster_0=np.where(cluster_labels==0)[0] # get indexes of points in cluster 0 
print("indexes of points in cluster 0:", index_cluster_0)
index_cluster_1=np.where(cluster_labels==1)[0] # get indexes of points in cluster 1
print("indexes of points in cluster 1:", index_cluster_1)
index_cluster_2=np.where(cluster_labels==2)[0] # get indexes of points in cluster 2
print("indexes of points in cluster 2:", index_cluster_2)

#plot the results
plt.plot(data2[index_cluster_0,0],data2[index_cluster_0,1],'ro') #samples in cluster 0 are red
plt.plot(data2[index_cluster_1,0],data2[index_cluster_1,1],'bo') #samples in cluster 1 are blue
plt.plot(data2[index_cluster_2,0],data2[index_cluster_2,1],'go') #samples in cluster 2 are green
plt.title('Cluster 0: red, Cluster 1: blue, Cluster 2: green')
plt.show()
输出:

 data with two features:
 [[ 0.78528736  1.06750913]
 [ 7.42907267  0.7576433 ]
 [ 7.15249919 -0.32552917]
 [-4.40158551 -0.37613828]
 [ 6.26503561 -0.55127462]
 [ 1.95747537  0.30362524]
 [-4.9903965   0.69673761]
 [ 0.83759652 -0.57439496]
 [ 0.51143086 -0.70659728]
 [-4.96288343 -0.46968141]
 [-5.28904909 -0.60188372]
 [-5.29448305  0.77998415]]


请注意,您可以在原始数据和PCA获得的数据中使用获得的索引。

感谢@Roy的建议。我会看的。谢谢@Roy的建议。我来看看。