Optimization Scikit学习小批量KMeans：稀疏与密集矩阵聚类_Optimization_Scipy_Scikit Learn

Optimization Scikit学习小批量KMeans：稀疏与密集矩阵聚类

optimization scikit-learn

Optimization Scikit学习小批量KMeans：稀疏与密集矩阵聚类,optimization,scipy,scikit-learn,Optimization,Scipy,Scikit Learn,我刚刚写了一段我很难理解的代码，任何帮助都将不胜感激。问题是：为什么在稀疏矩阵上进行聚类要花费如此多的时间、更多的内存，并且与在密集格式的相同矩阵上进行聚类的行为不同这是代码。对于密集矩阵和稀疏矩阵，它只需执行以下操作：创建一个100K x 500矩阵在矩阵上拟合MinibatchKMeans估计量（我们不关心结果）显示拟合估计器所需的时间在这两个基准测试之间，内存被手动垃圾收集（以确保我们重新开始）当运行这段代码几次时（为了确保KMeans算法的随机性不是我发现的原因），我有几个惊

我刚刚写了一段我很难理解的代码，任何帮助都将不胜感激。问题是：为什么在稀疏矩阵上进行聚类要花费如此多的时间、更多的内存，并且与在密集格式的相同矩阵上进行聚类的行为不同

这是代码。对于密集矩阵和稀疏矩阵，它只需执行以下操作：

创建一个100K x 500矩阵

在矩阵上拟合MinibatchKMeans估计量（我们不关心结果）

显示拟合估计器所需的时间

在这两个基准测试之间，内存被手动垃圾收集（以确保我们重新开始）

当运行这段代码几次时（为了确保KMeans算法的随机性不是我发现的原因），我有几个惊喜：

当使用矩阵的密集表示时，为什么聚类算法需要约40倍的迭代才能收敛
为什么使用稀疏表示比密集表示要花两倍的时间来收敛，尽管执行的迭代要少得多
最后，我想在基准测试的稀疏版本中分配更多内存的原因是因为矩阵（随机创建）不包含任何0，这使得稀疏格式的内存效率更低。我说得对吗

以下是基准的输出：

>>>>> Dense Matrix Clustering
Init 1/10 with method: k-means++
Inertia for init 1/10: 11546.570096
[...]
Init 10/10 with method: k-means++
Inertia for init 10/10: 11554.093346
Minibatch iteration 1/100000: mean batch inertia: 42.160602, ewa inertia: 42.160602 
Minibatch iteration 2/100000: mean batch inertia: 41.914472, ewa inertia: 42.160110 
[...]
Minibatch iteration 977/100000: mean batch inertia: 41.750966, ewa inertia: 41.581670 
Minibatch iteration 978/100000: mean batch inertia: 41.719181, ewa inertia: 41.581945 
Converged (lack of improvement in inertia) at iteration 978/100000
Computing label assignment and total inertia
Clustered dense matrix in: 7.363s
Filename: experiments/dense_sparse_bench.py

Line #    Mem usage    Increment   Line Contents
================================================
    13     33.2 MiB      0.0 MiB   @profile
    14                             def bench_dense():
    15                                 # create a random dense matrix
    16     33.2 MiB      0.0 MiB       dense_matrix = np.random.random((
    17                                     100000,  # 100K 'fake' documents
    18    241.2 MiB    208.0 MiB           500  # 500 dimensions
    19                                 ))
    20    241.3 MiB      0.1 MiB       s = time.time()
    21    241.3 MiB      0.0 MiB       km = MiniBatchKMeans(
    22    241.4 MiB      0.2 MiB           n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1)
    23    405.0 MiB    163.6 MiB       km.fit_predict(dense_matrix)  # cluster the points
    24    405.0 MiB      0.0 MiB       print "Clustered dense matrix in: %.3fs" % (time.time() - s)

>>>>> Sparse Matrix Clustering
Init 1/10 with method: k-means++
Inertia for init 1/10: 11618.817774
[...]
Init 10/10 with method: k-means++
Inertia for init 10/10: 11609.579624
Minibatch iteration 1/100000: mean batch inertia: 42.105951, ewa inertia: 42.105951 
Minibatch iteration 2/100000: mean batch inertia: 42.375899, ewa inertia: 42.106491 
[...]
Minibatch iteration 21/100000: mean batch inertia: 41.912611, ewa inertia: 42.258551 
Minibatch iteration 22/100000: mean batch inertia: 41.662418, ewa inertia: 42.257358 
Converged (lack of improvement in inertia) at iteration 22/100000
Computing label assignment and total inertia
Clustered sparse matrix in: 14.243s
Filename: experiments/dense_sparse_bench.py

Line #    Mem usage    Increment   Line Contents
================================================
    27     38.5 MiB      0.0 MiB   @profile
    28                             def bench_sparse():
    29                                 # convert the dense matrix in sparse format
    30     38.5 MiB      0.0 MiB       sparse_matrix = csr_matrix(np.random.random((
    31                                     100000,  # 100K 'fake' documents
    32    271.0 MiB    232.5 MiB           500  # 500 dimensions
    33                                 )))
    34    271.1 MiB      0.1 MiB       s = time.time()
    35    271.1 MiB      0.0 MiB       km = MiniBatchKMeans(
    36    271.2 MiB      0.1 MiB           n_clusters=20, init='k-means++', batch_size=100, n_init=10, verbose=1)
    37    598.5 MiB    327.3 MiB       km.fit_predict(sparse_matrix)
    38    598.5 MiB      0.0 MiB       print "Clustered sparse matrix in: %.3fs" % (time.time() - s)

提前谢谢

您……发现了一个bug

你能用你的平台、sklearn版本等对此发表评论吗？这样我就可以向sklearn开发人员报告了？这是一个有缺陷的错误

我已经收紧了你的脚本（在MiniBatchKMeans构造函数中分配random_状态以确保“相同”的结果），然后开始挖掘。您的结果在第一次群集重新分配后出现分歧。因此，我修改了k_means_.py函数来吐出一些变量。在，我在“if n_reassign”循环中添加了以下打印语句：

然后我将verbose更改为0，并获得以下输出：

>>>>> Dense Matrix Clustering
b
to_reassign [False False False False False False False False False False  True False
 False  True False False  True False  True False]
np.where(to_reassign) (array([10, 13, 16, 18], dtype=int64),)
new_centers [11 24 33 72]
centers [ 0.51612664  0.48724141  0.50478939  0.46328761  0.41928756  0.50768023
  0.48635517  0.48744328  0.59401064  0.55509388  0.33723042  0.37875769
  0.5366691   0.71604087  0.36911868  0.4626776   0.37506238  0.60670616
  0.21136754  0.54321791]

>>>>>> Sparse Matrix Clustering
a
to_reassign [False False False False False False False False False False  True False
 False  True False False  True False  True False]
np.where(to_reassign) (array([10, 13, 16, 18], dtype=int64),)
new_centers [11 24 33 72]
centers [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.33723042  0.          0.
  0.71604087  0.          0.          0.37506238  0.          0.21136754
  0.        ]

我的脚本修改版本：

import time
import gc
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import MiniBatchKMeans as MiniBatchKMeans
#from memory_profiler import profile


#@profile
def bench_dense(a_random_matrix):
    print ">>>>> Dense Matrix Clustering"
    # create a random dense matrix
    dense_matrix = a_random_matrix.copy()
    s = time.time()
    km = MiniBatchKMeans(
        n_clusters=20, init='k-means++', 
        batch_size=100, 
        n_init=10, verbose=0,
        random_state=37,)
    km.fit_predict(dense_matrix)  # cluster the points
    print "Clustered dense matrix in: %.3fs" % (time.time() - s)


#@profile
def bench_sparse(a_random_matrix):
    print ">>>>>> Sparse Matrix Clustering"
    # convert the dense matrix in sparse format
    sparse_matrix = csr_matrix(a_random_matrix.copy())
    assert np.all((sparse_matrix == a_random_matrix).sum())
    s = time.time()
    km = MiniBatchKMeans(
        n_clusters=20, init='k-means++', 
        batch_size=100, 
        n_init=10, verbose=0,
        random_state=37,)
    km.fit_predict(sparse_matrix)
    print "Clustered sparse matrix in: %.3fs" % (time.time() - s)



if __name__ == '__main__':
    a_random_matrix = np.random.random((
        100000,  # 100K 'fake' documents
        500  # 500 dimensions
    ))
    try:
        np.random.seed(42)
        bench_dense(a_random_matrix)
    except AssertionError, e:
        print e
    gc.collect()
    try:
        np.random.seed(42)
        bench_sparse(a_random_matrix)
    except AssertionError, e:
        print e

我将首先使用相同的随机矩阵重新运行这个基准测试。如果不是通过传递相同的矩阵，那么就在每个函数的顶部设置种子。很好，完成了。虽然我很确定结果是一样的，但我得到的结果差异太大了，不能仅用矩阵生成的随机性来解释（记住，有500*100K，即每一步生成50M个随机数…），我不想使用完全相同的矩阵，因为我想在这两个步骤之间释放矩阵使用的内存。因此，如果我使用常规（而不是小批量）KMeans，收敛迭代次数和性能是相同的，密集型为812秒，稀疏型为3654秒，这对矩阵是有意义的（您的稀疏矩阵不是稀疏的，并且稀疏编码的开销大大增加了运行时间）。这是正确的，我本来希望稀疏表示中的非稀疏矩阵需要更多时间，但迭代次数仍然相同。感谢您下面的回答：）我使用的是Win 7 64位、Anaconda 2.1.0 64位、Python 2.7，sklearn 0.15.2，numpy 1.9.0I打开了一个我正在运行Mac OS X 10.9.5 64位，Python 2.7，scikit learn 0.15.0，numpy 1.8.0，scipy 0.14.0。当我有时间去做一点的时候，我会尝试更多地调查这个问题。

>>>>> Dense Matrix Clustering
b
to_reassign [False False False False False False False False False False  True False
 False  True False False  True False  True False]
np.where(to_reassign) (array([10, 13, 16, 18], dtype=int64),)
new_centers [11 24 33 72]
centers [ 0.51612664  0.48724141  0.50478939  0.46328761  0.41928756  0.50768023
  0.48635517  0.48744328  0.59401064  0.55509388  0.33723042  0.37875769
  0.5366691   0.71604087  0.36911868  0.4626776   0.37506238  0.60670616
  0.21136754  0.54321791]

>>>>>> Sparse Matrix Clustering
a
to_reassign [False False False False False False False False False False  True False
 False  True False False  True False  True False]
np.where(to_reassign) (array([10, 13, 16, 18], dtype=int64),)
new_centers [11 24 33 72]
centers [ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.33723042  0.          0.
  0.71604087  0.          0.          0.37506238  0.          0.21136754
  0.        ]

import time
import gc
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import MiniBatchKMeans as MiniBatchKMeans
#from memory_profiler import profile


#@profile
def bench_dense(a_random_matrix):
    print ">>>>> Dense Matrix Clustering"
    # create a random dense matrix
    dense_matrix = a_random_matrix.copy()
    s = time.time()
    km = MiniBatchKMeans(
        n_clusters=20, init='k-means++', 
        batch_size=100, 
        n_init=10, verbose=0,
        random_state=37,)
    km.fit_predict(dense_matrix)  # cluster the points
    print "Clustered dense matrix in: %.3fs" % (time.time() - s)


#@profile
def bench_sparse(a_random_matrix):
    print ">>>>>> Sparse Matrix Clustering"
    # convert the dense matrix in sparse format
    sparse_matrix = csr_matrix(a_random_matrix.copy())
    assert np.all((sparse_matrix == a_random_matrix).sum())
    s = time.time()
    km = MiniBatchKMeans(
        n_clusters=20, init='k-means++', 
        batch_size=100, 
        n_init=10, verbose=0,
        random_state=37,)
    km.fit_predict(sparse_matrix)
    print "Clustered sparse matrix in: %.3fs" % (time.time() - s)



if __name__ == '__main__':
    a_random_matrix = np.random.random((
        100000,  # 100K 'fake' documents
        500  # 500 dimensions
    ))
    try:
        np.random.seed(42)
        bench_dense(a_random_matrix)
    except AssertionError, e:
        print e
    gc.collect()
    try:
        np.random.seed(42)
        bench_sparse(a_random_matrix)
    except AssertionError, e:
        print e