Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/flutter/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python scikit学习缺少数据的集群_Python_Scikit Learn_Cluster Analysis_Missing Data - Fatal编程技术网

python scikit学习缺少数据的集群

python scikit学习缺少数据的集群,python,scikit-learn,cluster-analysis,missing-data,Python,Scikit Learn,Cluster Analysis,Missing Data,我想对缺少列的数据进行聚类。手动操作时,如果没有此列,我将计算缺少列时的距离 使用scikit学习,不可能丢失数据。也没有机会指定用户距离函数 是否有可能使用丢失的数据进行群集 示例数据: n_samples = 1500 noise = 0.05 X, _ = make_swiss_roll(n_samples, noise) rnd = np.random.rand(X.shape[0],X.shape[1]) X[rnd<0.1] = np.nan n_样本=1500 噪音

我想对缺少列的数据进行聚类。手动操作时,如果没有此列,我将计算缺少列时的距离

使用scikit学习,不可能丢失数据。也没有机会指定用户距离函数

是否有可能使用丢失的数据进行群集

示例数据:

n_samples = 1500
noise = 0.05  
X, _ = make_swiss_roll(n_samples, noise)

rnd = np.random.rand(X.shape[0],X.shape[1]) 
X[rnd<0.1] = np.nan
n_样本=1500
噪音=0.05
十、 _uu=制作瑞士卷(n_样本,噪音)
rnd=np.random.rand(X.shape[0],X.shape[1])

X[rnd我认为您可以使用迭代EM型算法:

将缺少的值初始化为其列意味着

重复上述步骤,直至收敛:

  • 对填写的数据进行K-means聚类

  • 将缺少的值设置为指定给它们的簇的质心坐标

实施 假数据示例

更新:
事实上,在快速的谷歌搜索之后,我发现上面的算法与缺失数据k-means聚类的k-POD算法几乎相同。

这里是一个不同的算法,我使用它。不替换缺失的值,而是忽略这些值,以捕捉缺失和非缺失数据之间的差异唱歌,我在寻找失踪的傻瓜

与Alis算法相比,缺少观测值的观测值似乎更容易从一个类跳到另一个类,因为我不填充缺少的值

幸运的是,我没有时间来比较使用Ali的漂亮代码,但可以自由地进行比较(我可能会在有时间时进行比较),并为讨论最佳方法做出贡献

import numpy as np
class kmeans_missing(object):
    def __init__(self,potential_centroids,n_clusters):
        #initialize with potential centroids
        self.n_clusters=n_clusters
        self.potential_centroids=potential_centroids
    def fit(self,data,max_iter=10,number_of_runs=1):
        n_clusters=self.n_clusters
        potential_centroids=self.potential_centroids

        dist_mat=np.zeros((data.shape[0],n_clusters))
        all_centroids=np.zeros((n_clusters,data.shape[1],number_of_runs))

        costs=np.zeros((number_of_runs,))
        for k in range(number_of_runs):
            idx=np.random.choice(range(potential_centroids.shape[0]), size=(n_clusters), replace=False)
            centroids=potential_centroids[idx]
            clusters=np.zeros(data.shape[0])
            old_clusters=np.zeros(data.shape[0])
            for i in range(max_iter):
                #Calc dist to centroids
                for j in range(n_clusters):
                    dist_mat[:,j]=np.nansum((data-centroids[j])**2,axis=1)
                #Assign to clusters
                clusters=np.argmin(dist_mat,axis=1)
                #Update clusters
                for j in range(n_clusters):
                    centroids[j]=np.nanmean(data[clusters==j],axis=0)
                if all(np.equal(clusters,old_clusters)):
                    break # Break when to change in clusters
                if i==max_iter-1:
                    print('no convergence before maximal iterations are reached')
                else:
                    clusters,old_clusters=old_clusters,clusters

            all_centroids[:,:,k]=centroids
            costs[k]=np.mean(np.min(dist_mat,axis=1))
        self.costs=costs
        self.cost=np.min(costs)
        self.best_model=np.argmin(costs)
        self.centroids=all_centroids[:,:,self.best_model]
        self.all_centroids=all_centroids
    def predict(self,data):
        dist_mat=np.zeros((data.shape[0],self.n_clusters))
        for j in range(self.n_clusters):
            dist_mat[:,j]=np.nansum((data-self.centroids[j])**2,axis=1)
        prediction=np.argmin(dist_mat,axis=1)
        cost=np.min(dist_mat,axis=1)
        return prediction,cost
下面是一个示例,说明如何使用它

from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from kmeans_missing import *

def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
                   n_features=2, seed=None):
    # complete data
    gen = np.random.RandomState(seed)
    X, true_labels = make_blobs(n_samples, n_features, n_clusters,
                                random_state=gen)
    # with missing values
    missing = gen.rand(*X.shape) < fraction_missing
    Xm = np.where(missing, np.nan, X)
    return X, true_labels, Xm
X, true_labels, X_hat = make_fake_data(fraction_missing=0.3, n_clusters=3, seed=0)
X_missing_dummies=np.isnan(X_hat)
n_clusters=3
X_hat = np.concatenate((X_hat,X_missing_dummies),axis=1)
kmeans_m=kmeans_missing(X_hat,n_clusters)
kmeans_m.fit(X_hat,max_iter=100,number_of_runs=10)
print(kmeans_m.costs)
prediction,cost=kmeans_m.predict(X_hat)

for i in range(n_clusters):
    print([np.mean((prediction==i)*(true_labels==j)) for j in range(3)],np.mean((prediction==i)))
从sklearn.dataset导入make_blob
从matplotlib导入pyplot作为plt
从mpl_toolkits.mplot3d导入Axes3D
从kmeans_缺少导入*
def生成伪数据(分数缺失,n个簇=5,n个样本=1500,
n_特征=2,种子=无):
#完整数据
gen=np.random.RandomState(种子)
十、 true_labels=生成_blob(n_样本、n_特征、n_簇、,
随机(状态=发电机)
#缺少值
缺失=gen.rand(*X.shape)<缺失分数
Xm=np.where(缺失,np.nan,X)
返回X,真_标签,Xm
十、 真标签,X帽子=制造假数据(分数缺失=0.3,n簇=3,种子=0)
X_missing_dummies=np.isnan(X_帽子)
n_集群=3
X_hat=np.连接((X_hat,X_缺失的假人),轴=1)
kmeans\u m=kmeans\u缺失(X\u帽,n\u簇)
kmeans m.fit(X_hat,max_iter=100,运行次数=10)
打印(公里平均成本)
预测,成本=kmeans\u m.预测(X_hat)
对于范围内的i(n_簇):
范围(3)内j的打印([np.平均值((预测==i)*(真值==j)),np.平均值((预测==i)))
--编辑--


在本例中,缺失值的出现是完全随机的,在这种情况下。不添加缺失值假人会更好,因为在这种情况下缺失值假人是噪声。不包含它们也是正确的做法,以便与Ali的算法进行比较。

我想你可以通过给他们分配一个特定的值。通常,中位数或平均值作为替换。这可能看起来很奇怪,但实际上很标准。这似乎是一个可接受的解决方案吗?我希望避免分配例如全局平均值,因为这可能破坏适当的类分配。事实上,我希望使用聚类进行输入假设,即将聚类平均值分配给缺失值而不是全局平均值。如何计算缺失值的距离?缺失值可以是任何值,因此距离可能相差很远。您应该通过平均值或与其他变量的相关性输入缺失值。嗯……问得好。我想计算一种类型的距离标准化高斯距离,即(组件绝对距离之和)除以(组件之和)。这可以对所有列进行,也可以只对可用列进行。这是一个坏主意吗?我想到了naive bayes分类器,我也可以“跳过”缺失的列。好的,这似乎与我所拥有的非常接近(有点困惑)在我看来。谢谢你,我会试试这个。谢谢你对k-POD算法的提示。两组在绘图中翻转颜色的原因是什么?还是这是偶然的?@zelite颜色是由簇标签决定的,簇标签是按任意顺序设置的。实际上,对原始和插补的da使用同一组标签可能更清晰ta.如果我今天晚些时候有时间,我可能会更改它。@Cupitor,那可能是作弊:-)。如果我根据
标签给插补点上色,那么每个blob内的点的颜色将保证是均匀的。此外,由于推断簇的标签是随机初始化的,“true”之间的映射而插补的聚类标签是任意的。例如,顶部的聚类可能在原始数据中有标签3,但在插补的数据中有标签1。这将导致斑点的颜色被随机搅乱,这使得图形更难解释。@Cupitor 1)是的,
KMeans
对聚类初始化进行小批量处理。如果我们明确设置初始集群质心,然后
n_jobs
参数不起任何作用。2)我猜您可能只是内存不足。我必须深入研究sklearn的源代码才能确定,但大多数k-means实现都使用O(n+kd)内存,其中n是样本数,k是要查找的簇数,d是特征空间的维数。因此,内存需求将随着特征数的增加而成倍增加。
from sklearn.metrics import adjusted_mutual_info_score

fraction = np.arange(0.0, 1.0, 0.05)
n_repeat = 10
scores = np.empty((2, fraction.shape[0], n_repeat))
for i, frac in enumerate(fraction):
    for j in range(n_repeat):
        X, true_labels, Xm = make_fake_data(fraction_missing=frac, n_clusters=5)
        labels, centroids, X_hat = kmeans_missing(Xm, n_clusters=5)
        any_missing = np.any(~np.isfinite(Xm), 1)
        scores[0, i, j] = adjusted_mutual_info_score(labels, true_labels)
        scores[1, i, j] = adjusted_mutual_info_score(labels[any_missing],
                                                     true_labels[any_missing])

fig, ax = plt.subplots(1, 1)
scores_all, scores_missing = scores
ax.errorbar(fraction * 100, scores_all.mean(-1),
            yerr=scores_all.std(-1), label='All labels')
ax.errorbar(fraction * 100, scores_missing.mean(-1),
            yerr=scores_missing.std(-1),
            label='Labels with missing values')
ax.set_xlabel('% missing values')
ax.set_ylabel('Adjusted mutual information')
ax.legend(loc='best', frameon=False)
ax.set_ylim(0, 1)
ax.set_xlim(-5, 100)
import numpy as np
class kmeans_missing(object):
    def __init__(self,potential_centroids,n_clusters):
        #initialize with potential centroids
        self.n_clusters=n_clusters
        self.potential_centroids=potential_centroids
    def fit(self,data,max_iter=10,number_of_runs=1):
        n_clusters=self.n_clusters
        potential_centroids=self.potential_centroids

        dist_mat=np.zeros((data.shape[0],n_clusters))
        all_centroids=np.zeros((n_clusters,data.shape[1],number_of_runs))

        costs=np.zeros((number_of_runs,))
        for k in range(number_of_runs):
            idx=np.random.choice(range(potential_centroids.shape[0]), size=(n_clusters), replace=False)
            centroids=potential_centroids[idx]
            clusters=np.zeros(data.shape[0])
            old_clusters=np.zeros(data.shape[0])
            for i in range(max_iter):
                #Calc dist to centroids
                for j in range(n_clusters):
                    dist_mat[:,j]=np.nansum((data-centroids[j])**2,axis=1)
                #Assign to clusters
                clusters=np.argmin(dist_mat,axis=1)
                #Update clusters
                for j in range(n_clusters):
                    centroids[j]=np.nanmean(data[clusters==j],axis=0)
                if all(np.equal(clusters,old_clusters)):
                    break # Break when to change in clusters
                if i==max_iter-1:
                    print('no convergence before maximal iterations are reached')
                else:
                    clusters,old_clusters=old_clusters,clusters

            all_centroids[:,:,k]=centroids
            costs[k]=np.mean(np.min(dist_mat,axis=1))
        self.costs=costs
        self.cost=np.min(costs)
        self.best_model=np.argmin(costs)
        self.centroids=all_centroids[:,:,self.best_model]
        self.all_centroids=all_centroids
    def predict(self,data):
        dist_mat=np.zeros((data.shape[0],self.n_clusters))
        for j in range(self.n_clusters):
            dist_mat[:,j]=np.nansum((data-self.centroids[j])**2,axis=1)
        prediction=np.argmin(dist_mat,axis=1)
        cost=np.min(dist_mat,axis=1)
        return prediction,cost
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from kmeans_missing import *

def make_fake_data(fraction_missing, n_clusters=5, n_samples=1500,
                   n_features=2, seed=None):
    # complete data
    gen = np.random.RandomState(seed)
    X, true_labels = make_blobs(n_samples, n_features, n_clusters,
                                random_state=gen)
    # with missing values
    missing = gen.rand(*X.shape) < fraction_missing
    Xm = np.where(missing, np.nan, X)
    return X, true_labels, Xm
X, true_labels, X_hat = make_fake_data(fraction_missing=0.3, n_clusters=3, seed=0)
X_missing_dummies=np.isnan(X_hat)
n_clusters=3
X_hat = np.concatenate((X_hat,X_missing_dummies),axis=1)
kmeans_m=kmeans_missing(X_hat,n_clusters)
kmeans_m.fit(X_hat,max_iter=100,number_of_runs=10)
print(kmeans_m.costs)
prediction,cost=kmeans_m.predict(X_hat)

for i in range(n_clusters):
    print([np.mean((prediction==i)*(true_labels==j)) for j in range(3)],np.mean((prediction==i)))