Python 集群看起来并不正确_Python_Cluster Analysis_Clustering Key

Python 集群看起来并不正确

python

Python 集群看起来并不正确,python,cluster-analysis,clustering-key,Python,Cluster Analysis,Clustering Key,我不知道我的代码哪里出错了，我的图中没有显示所有4个集群。有什么想法吗 kmeans=kmeans（n_集群=4） kmeans.fit（x） y_kmeans=kmeans.fit_predict（x）打印（kmeans.群集\中心）打印（kmeans.labels_u3;） plt.scatter（x[y_kmeans==0,0]，x[y_kmeans==0,1]，s=100，c='red'，label='Cluster 1'） plt.scatter（x[y_kmeans==1,0]，

我不知道我的代码哪里出错了，我的图中没有显示所有4个集群。有什么想法吗

kmeans=kmeans（n_集群=4）
kmeans.fit（x）
y_kmeans=kmeans.fit_predict（x）
打印（kmeans.群集\中心）
打印（kmeans.labels_u3;）
plt.scatter（x[y_kmeans==0,0]，x[y_kmeans==0,1]，s=100，c='red'，label='Cluster 1'）
plt.scatter（x[y_kmeans==1,0]，x[y_kmeans==1,1]，s=100，c='blue'，label='Cluster 2'）
plt.scatter（x[y_kmeans==2,0]，x[y_kmeans==2,1]，s=100，c='green'，label='Cluster 3'）
plt.scatter（x[y_kmeans==3,0]，x[y_kmeans==3,1]，s=100，c='洋红'，label='Cluster 4'）
plt.scatter（kmeans.cluster_centers_u[：，0），kmeans.cluster_centers_[：，1]，s=200，c='黄色'，标签=
“质心”）
产品名称（“客户群”）
plt.xlabel（“分数”）
plt.ylabel（“”）
plt.show（）

您的数据是连续的还是分类的？这似乎是明确的。计算二进制变量之间的距离没有多大意义。并非所有数据都适合集群

我没有您的实际数据，但我将向您展示如何使用规范的MTCars样本数据正确或错误地进行聚类

# import mtcars data from web, and do some clustering on the data set


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans 


# Import CSV mtcars
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
# Edit element of column header
data.rename(columns={'Unnamed: 0':'brand'}, inplace=True)


X1= data.iloc[:,1:12]
Y1= data.iloc[:,-1]

#lets try to plot Decision tree to find the feature importance
from sklearn.tree import DecisionTreeClassifier
tree= DecisionTreeClassifier(criterion='entropy', random_state=1)
tree.fit(X1, Y1)



imp= pd.DataFrame(index=X1.columns, data=tree.feature_importances_, columns=['Imp'] )
imp.sort_values(by='Imp', ascending=False)

sns.barplot(x=imp.index.tolist(), y=imp.values.ravel(), palette='coolwarm')

X=data[['cyl','drat']]
Y=data['carb']

#lets try to create segments using K means clustering
from sklearn.cluster import KMeans
#using elbow method to find no of clusters
wcss=[]
for i in range(1,7):
    kmeans= KMeans(n_clusters=i, init='k-means++', random_state=1)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)


plt.plot(range(1,7), wcss, linestyle='--', marker='o', label='WCSS value')
plt.title('WCSS value- Elbow method')
plt.xlabel('no of clusters- K value')
plt.ylabel('Wcss value')
plt.legend()
plt.show()

kmeans.predict(X)


#Cluster Center
kmeans = MiniBatchKMeans(n_clusters = 5)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)


colors = ["green", "red", "blue", "yellow", "orange"]
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=np.array(colors)[labels], s = 10, alpha=.1)
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10, c=colors)
plt.show()

正如您所看到的，选择用于集群的特性会对结果产生巨大的影响（显然）。第一个例子看起来有点像你的结果，第二个例子看起来像一个更有用/有趣的聚类实验。

它是分类的，因为每个变量的分数是1-4，而不是一个连续变量。项目要求对4个变量进行集群。

...now, I am just changing two features (two independent variables)...and re-running the same experiment...

X=data[['wt','qsec']]
Y=data['carb']

#lets try to create segments using K means clustering
from sklearn.cluster import KMeans
#using elbow method to find no of clusters
wcss=[]
for i in range(1,7):
    kmeans= KMeans(n_clusters=i, init='k-means++', random_state=1)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)


plt.plot(range(1,7), wcss, linestyle='--', marker='o', label='WCSS value')
plt.title('WCSS value- Elbow method')
plt.xlabel('no of clusters- K value')
plt.ylabel('Wcss value')
plt.legend()
plt.show()

kmeans.predict(X)


#Cluster Center
kmeans = MiniBatchKMeans(n_clusters = 5)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)


colors = ["green", "red", "blue", "yellow", "orange"]
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=np.array(colors)[labels], s = 10, alpha=.1)
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10, c=colors)
plt.show()