Apache spark ApacheSpark中Kmeans簇索引的二分法 Apache Spark从链接矩阵中平分Kmeans密度图_Apache Spark

Apache spark ApacheSpark中Kmeans簇索引的二分法 Apache Spark从链接矩阵中平分Kmeans密度图

apache-spark

Apache spark ApacheSpark中Kmeans簇索引的二分法 Apache Spark从链接矩阵中平分Kmeans密度图,apache-spark,Apache Spark,我试图在Spark中生成一个将Kmeans聚类结果一分为二的dendogram。我在网上找到了这个问题的一些变体，比如有一个JIRA请求。但我还没有找到任何其他可行的解决方案为了尝试实现这一点，我使用yu iksw的for Spark编译了Spark MLlib 2.2.0，并对日志输出进行了一些更改，以生成关于对分聚类选择过程的更多信息。出于测试目的，我上传了这个Jar和一个示例SBT构建，这样任何有兴趣帮助不想从源代码重建sparkmlib的人都可以运行自己的测试。您可以在我的sbt bu

我试图在Spark中生成一个将Kmeans聚类结果一分为二的dendogram。我在网上找到了这个问题的一些变体，比如有一个JIRA请求。但我还没有找到任何其他可行的解决方案

为了尝试实现这一点，我使用yu iksw的for Spark编译了Spark MLlib 2.2.0，并对日志输出进行了一些更改，以生成关于对分聚类选择过程的更多信息。出于测试目的，我上传了这个Jar和一个示例SBT构建，这样任何有兴趣帮助不想从源代码重建sparkmlib的人都可以运行自己的测试。您可以在我的sbt build on中看到，mllib和mllib本地JAR位于/lib文件夹中

为了绘制我的测试链接矩阵输出，我使用jupyter笔记本，并手动将Spark链接输出传递到scipy dendogram。jupyter笔记本电脑也在销售中

简而言之，当我使用3-4个簇时，我使用Iris数据集的测试输出似乎有效，但当我尝试5个或更多簇时，链接矩阵无法生成有效的簇索引。我尝试了几种不同的方法来解决这个问题，比如改变toLinkageMatrix选择过程和它调用的toArray函数，但都没有效果

我对K-Means聚类的二分法有很好的概念性理解，但我很难找到链接矩阵在Spark中失败的确切位置/原因。如果你看我的spark笔记本HTMLs，你可以看到我完整的spark代码

用于编译的完整Spark 2.2.0源代码也在

已更改的主要源文件包括

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala

及


mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

下面是我使用3个和10个集群的集群输出

注意：我还重新编写了scipy dendogram测试函数，以准确说明绘图时z连杆失效的位置和原因

Iris数据，3个集群输出

Iris数据，10个集群输出


    COST
    84.20375254574043

    CLUSTER CENTERS
    Sepal_Length    Sepal_Width Petal_Length    Petal_Width
    5.01    3.37    1.56    0.29
    5.95    2.77    4.45    1.45
    6.85    3.07    5.74    2.07

    ADJACENCY MATRIX
    FromNodeID  toNodeID    distance
    0   1   2.540618378626947
    0   2   2.540618378626947
    2   3   1.044390196577994
    2   4   1.044390196577994

    LOG OUTPUTS
    Feature dimension: 4.
    Number of points: 150.
    Initial cost: 681.3705999999911.
    The minimum number of points of a divisible cluster is 1.
    Dividing 1 clusters on level 1.
    Dividing 1 clusters on level 2.
    The divisible clusters needed for this iteration were : d = 1, cost =681.3705999999911, size = 150
    The divisible clusters needed for this iteration were : d = 3, cost =123.79587628866193, size = 97

    LINKAGE MATRIX 
    node1   node2   distance    tree_size
    1   2   1.044390196577994   2
    0   3   2.540618378626947   3

    SCIPY DENDOGRAM


    COST
    27.981293071222368

    CLUSTER CENTERS (rounded)
    Sepal_Length    Sepal_Width Petal_Length    Petal_Width
    4.68    3.08    1.45     0.2
       5     2.4     3.2    1.03
    5.07    3.46    1.44    0.28
     5.4    3.89    1.51    0.27
     5.6    2.66    4.05    1.25
    6.01    2.71    4.95    1.79
     6.4    2.97    4.55    1.41
    6.49     2.9    5.37    1.8
    6.61    3.16    5.57    2.29
    7.48    3.13     6.3    2.05

    ADJACENCY MATRIX 
    fromNodeID  toNodeID    distance
    0   4   1.7986418455383477
    0   5   1.7986418455383477
    1   6   0.32116390552184665
    1   7   0.32116390552184665
    2   0   0.48992124227033834
    2   1   0.48992124227033834
    3   2   2.540618378626947
    3   10  2.540618378626947
    8   12  0.36029111643410283
    8   13  0.36029111643410283

    LOG OUTPUTS
    Feature dimension: 4.
    Number of points: 150.
    Initial cost: 681.3705999999911.
    The minimum number of points of a divisible cluster is 1.
    Dividing 1 clusters on level 1.
    Dividing 2 clusters on level 2.
    Dividing 4 clusters on level 3.
    Dividing 2 clusters on level 4.
    d =The divisible clusters needed for this iteration were : 
    d = 1, cost =681.3705999999911, size = 150
    The divisible clusters needed for this iteration were : 
    d = 3, cost =123.79587628866193, size = 97
    The divisible clusters needed for this iteration were : 
    d = 4, cost =13.72863636363627, size = 22
    The divisible clusters needed for this iteration were : 
    d = 13, cost =10.73588235294028, size = 34

    LINKAGE MATRIX
    node1   node2   distance    tree_size
    2   3   0.32116390552184665 2
    7   8   0.34473006617374347 2
    5   6   0.36029111643410283 2
    17  10  0.48992124227033834 4  
    4   12  0.5802085628837165  3
    11  9   0.839611851358609   3
    14  15  1.044390196577994   6
     0  1   1.7986418455383477  2
    13  16  2.540618378626947   10

    SCIPY DENDOGRAM FAILURE TEST FUNCTION OUTPUTS

    [2.         3.         0.32116391 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              2.0 >= 10+0
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              3.0 >= 10+0
    = False
    -------------------------------------------------

    [7.         8.         0.34473007 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              7.0 >= 10+1
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              8.0 >= 10+1
    = False
    -------------------------------------------------

    [5.         6.         0.36029112 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              5.0 >= 10+2
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              6.0 >= 10+2
    = False
    -------------------------------------------------

    [17.         10.          0.48992124  4.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              17.0 >= 10+3
    = True
    Checking .... if indice B >= # of clusters + iteration we are on
              10.0 >= 10+3
    = False
    -------------------------------------------------

    [ 4.         12.          0.58020856  3.        ]

    Checking.... if indice A >= # of clusters + iteration we are on
              4.0 >= 10+4
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              12.0 >= 10+4
    = False
    -------------------------------------------------

    [11.          9.          0.83961185  3.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              11.0 >= 10+5
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              9.0 >= 10+5
    = False
    -------------------------------------------------

    [14.        15.         1.0443902  6.       ]
    Checking.... if indice A >= # of clusters + iteration we are on
              14.0 >= 10+6
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              15.0 >= 10+6
    = False
    -------------------------------------------------

    [0.         1.         1.79864185 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              0.0 >= 10+7
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              1.0 >= 10+7
    = False
    -------------------------------------------------

    [13.         16.          2.54061838 10.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              13.0 >= 10+8
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              16.0 >= 10+8
    = False
    -------------------------------------------------