Python 在应用PCA和聚类方法后，如何找到每个聚类中的数据索引？_Python_Data Science_Cluster Analysis_K Means_Pca

Python 在应用PCA和聚类方法后，如何找到每个聚类中的数据索引？

python

Python 在应用PCA和聚类方法后，如何找到每个聚类中的数据索引？,python,data-science,cluster-analysis,k-means,pca,Python,Data Science,Cluster Analysis,K Means,Pca,我想将绘图的每个点排序到表中相应的索引下面是我处理的前三行数据在将PCA（2D）应用于我的数据之后，我得到了一个类似于此的图，然后我使用kmeans算法进行聚类 . 假设我获取红色集群中的所有数据。如何在表中找到这些数据集的索引？或者如何找到表中每个点的索引？目标是能够将绘图中的每个点排序到表中相应的线。我使用Python 欢迎发表文章或文件来解释这一点。提前谢谢。您可以在每个簇中找到点的索引，如下所示假设数据具有12个样本，每个样本具有三个特征。PCA用于减少特征的数量。然后，使用

我想将绘图的每个点排序到表中相应的索引
下面是我处理的前三行数据

在将PCA（2D）应用于我的数据之后，我得到了一个类似于此的图，然后我使用kmeans算法进行聚类

.
假设我获取红色集群中的所有数据。如何在表中找到这些数据集的索引？
或者如何找到表中每个点的索引？
目标是能够将绘图中的每个点排序到表中相应的线。我使用Python

欢迎发表文章或文件来解释这一点。提前谢谢。

您可以在每个簇中找到点的索引，如下所示

假设数据具有12个样本，每个样本具有三个特征。PCA用于减少特征的数量。然后，使用k-means对数据进行聚类

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import decomposition

data=np.array([[5,4.5,6],[1,1,2],[1,2,1.5],[8,8.5,8],[2,2.5,1.5],[4,4.5,5],[8.5,8,9],[4,6,5.5],[5,6,5],[8,9,8.5],[9,9,8],[9,8,9]])

pca = decomposition.PCA(n_components=2) # apply pca
pca.fit(data)
data2 = pca.transform(data) # new data saved in data2
print("data with two features:\n", data2)
plt.plot(data2[:,0],data2[:,1],'ro')
for i in range(data2.shape[0]):        
        plt.text(data2[i,0],data2[i,1], str(i), fontsize=12)
plt.show()

#run Kmeans on data2 with 3 clusters
km=KMeans(n_clusters=3) # number of clusters =3 
km=km.fit(data2)
cluster_labels=km.labels_ # get cluster label of all data
print("cluster labels of points:", cluster_labels)

# get indexes of points in each cluster 
#Note: you can use these indexes in both data and data2
index_cluster_0=np.where(cluster_labels==0)[0] # get indexes of points in cluster 0 
print("indexes of points in cluster 0:", index_cluster_0)
index_cluster_1=np.where(cluster_labels==1)[0] # get indexes of points in cluster 1
print("indexes of points in cluster 1:", index_cluster_1)
index_cluster_2=np.where(cluster_labels==2)[0] # get indexes of points in cluster 2
print("indexes of points in cluster 2:", index_cluster_2)

#plot the results
plt.plot(data2[index_cluster_0,0],data2[index_cluster_0,1],'ro') #samples in cluster 0 are red
plt.plot(data2[index_cluster_1,0],data2[index_cluster_1,1],'bo') #samples in cluster 1 are blue
plt.plot(data2[index_cluster_2,0],data2[index_cluster_2,1],'go') #samples in cluster 2 are green
plt.title('Cluster 0: red, Cluster 1: blue, Cluster 2: green')
plt.show()

输出：

 data with two features:
 [[ 0.78528736  1.06750913]
 [ 7.42907267  0.7576433 ]
 [ 7.15249919 -0.32552917]
 [-4.40158551 -0.37613828]
 [ 6.26503561 -0.55127462]
 [ 1.95747537  0.30362524]
 [-4.9903965   0.69673761]
 [ 0.83759652 -0.57439496]
 [ 0.51143086 -0.70659728]
 [-4.96288343 -0.46968141]
 [-5.28904909 -0.60188372]
 [-5.29448305  0.77998415]]

请注意，您可以在原始数据和PCA获得的数据中使用获得的索引。

您可以在每个簇中找到点的索引，如下所示

假设数据具有12个样本，每个样本具有三个特征。PCA用于减少特征的数量。然后，使用k-means对数据进行聚类

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import decomposition

data=np.array([[5,4.5,6],[1,1,2],[1,2,1.5],[8,8.5,8],[2,2.5,1.5],[4,4.5,5],[8.5,8,9],[4,6,5.5],[5,6,5],[8,9,8.5],[9,9,8],[9,8,9]])

pca = decomposition.PCA(n_components=2) # apply pca
pca.fit(data)
data2 = pca.transform(data) # new data saved in data2
print("data with two features:\n", data2)
plt.plot(data2[:,0],data2[:,1],'ro')
for i in range(data2.shape[0]):        
        plt.text(data2[i,0],data2[i,1], str(i), fontsize=12)
plt.show()

#run Kmeans on data2 with 3 clusters
km=KMeans(n_clusters=3) # number of clusters =3 
km=km.fit(data2)
cluster_labels=km.labels_ # get cluster label of all data
print("cluster labels of points:", cluster_labels)

# get indexes of points in each cluster 
#Note: you can use these indexes in both data and data2
index_cluster_0=np.where(cluster_labels==0)[0] # get indexes of points in cluster 0 
print("indexes of points in cluster 0:", index_cluster_0)
index_cluster_1=np.where(cluster_labels==1)[0] # get indexes of points in cluster 1
print("indexes of points in cluster 1:", index_cluster_1)
index_cluster_2=np.where(cluster_labels==2)[0] # get indexes of points in cluster 2
print("indexes of points in cluster 2:", index_cluster_2)

#plot the results
plt.plot(data2[index_cluster_0,0],data2[index_cluster_0,1],'ro') #samples in cluster 0 are red
plt.plot(data2[index_cluster_1,0],data2[index_cluster_1,1],'bo') #samples in cluster 1 are blue
plt.plot(data2[index_cluster_2,0],data2[index_cluster_2,1],'go') #samples in cluster 2 are green
plt.title('Cluster 0: red, Cluster 1: blue, Cluster 2: green')
plt.show()

输出：

 data with two features:
 [[ 0.78528736  1.06750913]
 [ 7.42907267  0.7576433 ]
 [ 7.15249919 -0.32552917]
 [-4.40158551 -0.37613828]
 [ 6.26503561 -0.55127462]
 [ 1.95747537  0.30362524]
 [-4.9903965   0.69673761]
 [ 0.83759652 -0.57439496]
 [ 0.51143086 -0.70659728]
 [-4.96288343 -0.46968141]
 [-5.28904909 -0.60188372]
 [-5.29448305  0.77998415]]

请注意，您可以在原始数据和PCA获得的数据中使用获得的索引。

感谢@Roy的建议。我会看的。谢谢@Roy的建议。我来看看。