python无监督学习dbscan scikit应用程序示例

python无监督学习dbscan scikit应用程序示例,python,machine-learning,scikit-learn,dbscan,unsupervised-learning,Python,Machine Learning,Scikit Learn,Dbscan,Unsupervised Learning,我有以下列表,我想在无监督的情况下学习,并使用这些知识预测测试列表中每个项目的值 #Format [real_runtime, processors, requested_time, score, more_to_be_added] #some entries from the list 训练数据集 测试数据集 使用集群,我想预测新列表的真实运行时间: Xtest=['-1',2048',1500',51.0000000 161'], ['-1', '2048', '10800', '51.0

我有以下列表,我想在无监督的情况下学习,并使用这些知识预测测试列表中每个项目的值

#Format [real_runtime, processors, requested_time, score, more_to_be_added]
#some entries from the list
训练数据集 测试数据集 使用集群,我想预测新列表的真实运行时间: Xtest=['-1',2048',1500',51.0000000 161'], ['-1', '2048', '10800', '51.0000000002'], ['-1', '512', '21600', '-1'], ['-1', '512', '2700', '51.0000000004'], [1,1024,21600,51.1042617556']

代码:在python中使用scikit格式化列表和生成集群,并绘制集群 你知道如何使用聚类来预测值吗?

聚类不是预测 “预测”聚类标签几乎没有什么用处,因为它只是由聚类算法“随机”分配的

更糟糕的是:大多数算法无法合并新数据

您确实应该使用集群来探索您的数据,并了解哪些数据存在,哪些不存在不要依赖于集群是“好的”。


有时,人们成功地将数据集量化到k个中心,然后仅使用此“压缩”数据集进行分类/预测(通常仅基于最近邻)。我也看到过这样的想法,即每个聚类训练一个回归进行预测,并使用最近邻选择要应用的回归器(即,如果数据很适合聚类,则使用聚类回归模型)。但我不记得任何重大的成功案例…

为什么要进行聚类,而不是进行简单的多元回归?正如Prune已经建议的那样,首先进行聚类是没有意义的。这样做的唯一原因是因为您知道以后如何使用它,而不这样做是行不通的。一般来说,这不是解决问题的方法,从最简单的解决方案开始,如果失败,就寻找更复杂的解决方案。最初的想法是从数百条记录中学习,并利用这些知识预测下一条记录的一个属性。我之所以考虑聚类,是因为新记录(predict)将与一些到目前为止学习(处理)的记录(并非所有记录)具有相似性。
Xsrc = [['354', '2048', '3600', '53.0521472395'], 
      ['605', '2048', '600', '54.8768871369'], 
      ['128', '2048', '600', '51.0'], 
      ['136', '2048', '900', '51.0000000563'], 
      ['19218', '480', '21600', '51.0'], 
      ['15884', '2048', '18000', '51.0'], 
      ['118', '2048', '1500', '51.0'], 
      ['103', '2048', '2100', '51.0000002839'], 
      ['18542', '480', '21600', '51.0000000001'], 
      ['13272', '2048', '18000', '51.0000000001']]
from sklearn.feature_selection import VarianceThreshold
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

##Training dataset
Xsrc = [['354', '2048', '3600', '53.0521472395'], 
      ['605', '2048', '600', '54.8768871369'], 
      ['128', '2048', '600', '51.0'], 
      ['136', '2048', '900', '51.0000000563'], 
      ['19218', '480', '21600', '51.0'], 
      ['15884', '2048', '18000', '51.0'], 
      ['118', '2048', '1500', '51.0'], 
      ['103', '2048', '2100', '51.0000002839'], 
      ['18542', '480', '21600', '51.0000000001'], 
      ['13272', '2048', '18000', '51.0000000001']]

print "Xsrc:", Xsrc

##Test data set
Xtest= [['1224', '2048', '1500', '51.0000000161'],
       ['7867', '2048', '10800', '51.0000000002'],
       ['21594', '512', '21600', '-1'], 
       ['1760', '512', '2700', '51.0000000004'],
       ['115', '1024', '21600', '51.1042617556']]


##Clustering 
X = StandardScaler().fit_transform(Xsrc)
db = DBSCAN(min_samples=2).fit(X) #no clustering parameter, such as default eps
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
clusters = [X[labels == i] for i in xrange(n_clusters_)]

print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))


##Plotting the dataset
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=20)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=10)


plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()