Python 基于度量对列表中的元素进行聚类
我有一个字典列表,这些字典是关键字及其向量距离,我正在尝试应用聚类技术对它们进行分组Python 基于度量对列表中的元素进行聚类,python,python-3.x,machine-learning,cluster-analysis,Python,Python 3.x,Machine Learning,Cluster Analysis,我有一个字典列表,这些字典是关键字及其向量距离,我正在尝试应用聚类技术对它们进行分组 # data = [{"key": "str1", "weight": float value}, ...] # distances = [item['weight'] for item in data] distances = [0.004906579754566209, 0.008361678408906337, 0.010228429212122
# data = [{"key": "str1", "weight": float value}, ...]
# distances = [item['weight'] for item in data]
distances = [0.004906579754566209, 0.008361678408906337, 0.010228429212122636, 0.013671005756098031, 0.013671005756098031, 0.013713535105272179]
mean_distances_differences = mean([j-i for i, j in zip(distances[:-1], distances[1:])])
我计算了列表中两个连续元素之间差异的平均值。如果两个元素之间的距离小于平均值,我想对它们进行聚类,因此结果将是
[[0.004906579754566209], [0.008361678408906337], [0.010228429212122636], [0.013671005756098031, 0.013671005756098031, 0.013713535105272179]]
在这里,我想我不能使用knn,因为我不知道会出现多少簇。所以我试过这样做
distances = [item['weight'] for item in data]
mean_distances_differences = mean([j-i for i, j in zip(distances[:-1], distances[1:])])
distances_new = distances
required_list = []
while distances_new:
temp = []
if len(distances_new) == 1:
temp = distances_new
required_list.append(temp)
break
else:
for i,j in zip(distances_new[:-1], distances_new[1:]):
if j-1 < mean_distances_differences:
temp.append(i)
else:
break
distances_new = [_i for _i in distances_new if _i not in temp]
required_list.append(temp)
有什么办法吗?你可以使用diff来计算距离,我取绝对值,因为我不确定距离是否会被排序:
import numpy as np
distance_diff = abs(np.diff(distances))
如果不确定距离是否大于某个值,则会将小于阈值的连续元素组合在一起:
np.cumsum(distance_diff > abs(np.mean(distance_diff)))]
array([1, 2, 3, 3, 3])
因此,剩下的就是提供一个起始组0:
np.hstack([0,np.cumsum(distance_diff > abs(np.mean(distance_diff)))])
array([0, 1, 2, 3, 3, 3])
您可以使用diff来计算距离,我采用绝对值,因为我不确定距离是否会被排序:
import numpy as np
distance_diff = abs(np.diff(distances))
如果不确定距离是否大于某个值,则会将小于阈值的连续元素组合在一起:
np.cumsum(distance_diff > abs(np.mean(distance_diff)))]
array([1, 2, 3, 3, 3])
因此,剩下的就是提供一个起始组0:
np.hstack([0,np.cumsum(distance_diff > abs(np.mean(distance_diff)))])
array([0, 1, 2, 3, 3, 3])