Python 如何使用具有多个特性的dataframe应用KMeans获取质心_Python_Dataframe_K Means_Euclidean Distance

Python 如何使用具有多个特性的dataframe应用KMeans获取质心

python dataframe

Python 如何使用具有多个特性的dataframe应用KMeans获取质心,python,dataframe,k-means,euclidean-distance,Python,Dataframe,K Means,Euclidean Distance,我遵循这个详细的KMeans教程：它使用具有2个特性的数据集但是我有一个包含5个特征（列）的数据帧，因此我没有在教程中使用def euclidean_distance（x1，x2）：函数，而是按如下方式计算欧氏距离 def euclidean_distance(df): n = df.shape[1] distance_matrix = np.zeros((n,n)) for i in range(n): for j in range(n):

我遵循这个详细的KMeans教程：它使用具有2个特性的数据集

但是我有一个包含5个特征（列）的数据帧，因此我没有在教程中使用

def euclidean_distance（x1，x2）：

函数，而是按如下方式计算欧氏距离

def euclidean_distance(df):
    n = df.shape[1]
    distance_matrix = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            distance_matrix[i,j] = np.sqrt(np.sum((df.iloc[:,i] - df.iloc[:,j])**2))
    return distance_matrix

接下来，我想实现教程中计算质心的部分，如下所示

def _closest_centroid(self, sample, centroids):
    distances = [euclidean_distance(sample, point) for point in centroids]

由于我的

def euclidean_distance（df）：

函数只接受一个参数df，如何最好地实现它以获得质心

我的示例数据集df如下所示：

col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,-2.14
0.52,0.44,0.19,0.29,30.44
1.27,1.15,1.32,0.60,-161.63
0.88,0.79,0.63,0.58,-49.52
1.39,1.15,1.32,0.41,-188.52
0.86,0.80,0.65,0.65,-45.27

[添加了：plot（）函数

您包含的绘图函数给出了一个错误TypeError:itertools.combines类型的对象没有len（），我通过将
len（combines）
更改为
len（list（combines））
来修复该错误。但是，输出不是散点图。你知道我需要在这里修复什么吗？
读取数据并对其进行聚类它不应该抛出任何错误，即使你增加了数据集中的功能数量。事实上，当你重新定义欧几里得距离函数时，你只会在代码的这一部分得到一个错误
此asnwer解决了正在获取的绘图函数的实际错误
获取给定簇中的所有点，并尝试绘制散点图

ax.scatter（*point）
中的星号表示该点未打包
这里隐含的假设（这就是为什么很难发现）是
点
应该是二维的。然后，将单个零件解释为要绘制的x、y值
但因为你们有5个特征，所以这个点是5维的
看看：
前四个，即x、y、s和c允许浮动，但您的数据集是5维的，因此第五个特征被解释为marker，它需要MarkerStyle。因为它得到一个浮点，所以它抛出错误
怎么办：一次只能查看2或3个维度，或使用维度缩减（如主成分分析）将数据投影到较低维度空间
对于第一个选项，可以在KMeans类中重新定义plot方法：

def plot(self): import itertools combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos for i, index in enumerate(self.clusters): point = self.X[index].T # only get the coordinates for this combination: px, py = point[x], point[y] ax.scatter(px, py) for point in self.centroids: # only get the coordinates for this combination: px, py = point[x], point[y] ax.scatter(px, py, marker="x", color='black', linewidth=2) ax.set_title('feature {} vs feature {}'.format(x,y)) plt.show()

教程中的欧几里德距离函数是为数组定义的，因此空间的维数无关紧要。这意味着您不需要编写自己的函数。本教程中的函数适用于具有任意数量功能的两个数组（我在前面的评论中所指的维度）。它根据数据集的形状推断特征的数量。运行教程中的代码时，您在哪里会遇到错误？第81行，在_最近的_质心
距离=[euclidean_距离（示例，点）表示质心中的点]
类型错误：euclidean_距离（）正好取1个参数（给定2个参数）。repo中的函数不会抛出该错误，因为它取2个参数。你确定要使用这个吗？当我使用tutorial函数时，第17行的kmeans_test.py:y_pred=k.predict（X）抛出“ValueError:unrecogned marker style[13.15717]”，指向kmeans.py文件的第35行和第93行。如前所述，我更改make_blobs（）函数以适应具有31行和5列（特性）的数据帧，如下所示。否则，教程代码运行良好，无需任何修改
data=pd.read\u csv（'df.csv'）
X=np.array（data）
print（X.shape）
clusters=5
k=KMeans（k=clusters，max\u iters=150，plot\u steps=True）
y\u pred=k.predict（X）
非常感谢@warped。我现在可以使用原始教程使用二维进行聚类（但是，上面的plot（self）函数会执行，但只打印黑色矩形，而不打印聚类）。我还尝试了PCA将5个特征减少到2维，并且能够使聚类工作。@Gee你可以使用ax.plot而不是ax。scatter@Gee你能把这个问题修改成你的问题还是问一个新问题？在评论中阅读这篇文章有点乏味。你知道如何使用上面建议的plot（）方法来绘制集群吗？我已经试过好几次了，但它没有绘制集群。 matplotlib.axes.Axes.scatter Axes.scatter(self, x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=<deprecated parameter>, edgecolors=None, *, plotnonfinite=False, data=None, **kwargs) x y s (i.e. the markersize) c (i.e. the color) marker (i.e. the markerstyle) def plot(self): import itertools combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos for i, index in enumerate(self.clusters): point = self.X[index].T # only get the coordinates for this combination: px, py = point[x], point[y] ax.scatter(px, py) for point in self.centroids: # only get the coordinates for this combination: px, py = point[x], point[y] ax.scatter(px, py, marker="x", color='black', linewidth=2) ax.set_title('feature {} vs feature {}'.format(x,y)) plt.show()