Python 基于标记的共现图像聚类_Python_Graph_Cluster Analysis_Similarity_Google Vision

Python 基于标记的共现图像聚类

python graph

Python 基于标记的共现图像聚类,python,graph,cluster-analysis,similarity,google-vision,Python,Graph,Cluster Analysis,Similarity,Google Vision,我用Google Vision API标记了很多对象图像。使用这些标签（pickle中的列表），我创建了一个标签共现矩阵（下载为numpy数组）。矩阵的大小为2195x2195 加载数据： import pickle import numpy as np with open('labels.pkl', 'rb') as f: labels = pickle.load(f) cooccurrence = np.load('cooccurrence.npy') 我想使用聚类分析来定义合理

我用Google Vision API标记了很多对象图像。使用这些标签（pickle中的列表），我创建了一个标签共现矩阵（下载为numpy数组）。矩阵的大小为2195x2195

加载数据：

import pickle
import numpy as np
with open('labels.pkl', 'rb') as f:
    labels = pickle.load(f)

cooccurrence = np.load('cooccurrence.npy')

我想使用聚类分析来定义合理数量的聚类（定义为视觉标签列表），这些聚类表示一些对象（例如汽车、鞋子、书籍等）。我不知道正确的集群数量是多少

我尝试了scikit学习中可用的分层聚类算法：

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 1000)

#creating non-symetrical "similarity" matrix:
occurrences = cooccurrence.diagonal().copy()
similarities = cooccurrence / occurrences[:,None]

#clustering:
from sklearn.cluster import AgglomerativeClustering
clusters = AgglomerativeClustering(n_clusters=200, affinity='euclidean', linkage='ward').fit_predict(similarities)

#results in pandas:
df_clusters = pd.DataFrame({'cluster': clusters.tolist(), 'label': labels})
df_clusters_grouped = df_clusters.groupby(['cluster']).agg({'label': [len, list]})
df_clusters_grouped.columns = [' '.join(col).strip() for col in df_clusters_grouped.columns.values]
df_clusters_grouped.rename(columns = {'label len': 'cluster_size', 'label list': 'cluster_labels'}, inplace=True)
df_clusters_grouped.sort_values(by=['cluster_size'], ascending=False)

这样，我就可以创建200个集群，其中一个集群看起来像：

["Racket", "Racquet sport", "Tennis racket", "Rackets", "Tennis", "Racketlon", "Tennis racket accessory", "Strings"]

这在某种程度上是可行的，但我更愿意使用一些软聚类方法，可以将一个标签分配给多个聚类（例如，“皮革”可能对鞋子和钱包有意义）。此外，我还必须定义集群的数量（在我的示例代码中为200），这是我希望得到的结果（如果可能的话）

我也在玩，但我没有得到更好的输出。

聚类方法，如sklearn的凝聚聚类，需要一个数据矩阵作为输入。使用
metric=“precomputed”
还可以使用距离矩阵（对于k-means和高斯混合建模，它们确实需要坐标数据）
然而，你有一个共现矩阵或相似矩阵。这些值具有相反的含义，因此您必须识别一个适当的转换（例如发生率）。将共现矩阵视为数据矩阵（然后使用欧几里德距离——这就是您要做的）在某种程度上可以工作，但语义非常奇怪，不推荐使用