Apache spark ApacheSpark中Kmeans簇索引的二分法 Apache Spark从链接矩阵中平分Kmeans密度图
我试图在Spark中生成一个将Kmeans聚类结果一分为二的dendogram。我在网上找到了这个问题的一些变体,比如有一个JIRA请求。但我还没有找到任何其他可行的解决方案 为了尝试实现这一点,我使用yu iksw的for Spark编译了Spark MLlib 2.2.0,并对日志输出进行了一些更改,以生成关于对分聚类选择过程的更多信息。出于测试目的,我上传了这个Jar和一个示例SBT构建,这样任何有兴趣帮助不想从源代码重建sparkmlib的人都可以运行自己的测试。您可以在我的sbt build on中看到,mllib和mllib本地JAR位于/lib文件夹中 为了绘制我的测试链接矩阵输出,我使用jupyter笔记本,并手动将Spark链接输出传递到scipy dendogram。jupyter笔记本电脑也在销售中 简而言之,当我使用3-4个簇时,我使用Iris数据集的测试输出似乎有效,但当我尝试5个或更多簇时,链接矩阵无法生成有效的簇索引。我尝试了几种不同的方法来解决这个问题,比如改变toLinkageMatrix选择过程和它调用的toArray函数,但都没有效果 我对K-Means聚类的二分法有很好的概念性理解,但我很难找到链接矩阵在Spark中失败的确切位置/原因。如果你看我的spark笔记本HTMLs,你可以看到我完整的spark代码 用于编译的完整Spark 2.2.0源代码也在 已更改的主要源文件包括Apache spark ApacheSpark中Kmeans簇索引的二分法 Apache Spark从链接矩阵中平分Kmeans密度图,apache-spark,Apache Spark,我试图在Spark中生成一个将Kmeans聚类结果一分为二的dendogram。我在网上找到了这个问题的一些变体,比如有一个JIRA请求。但我还没有找到任何其他可行的解决方案 为了尝试实现这一点,我使用yu iksw的for Spark编译了Spark MLlib 2.2.0,并对日志输出进行了一些更改,以生成关于对分聚类选择过程的更多信息。出于测试目的,我上传了这个Jar和一个示例SBT构建,这样任何有兴趣帮助不想从源代码重建sparkmlib的人都可以运行自己的测试。您可以在我的sbt bu
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
及
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
下面是我使用3个和10个集群的集群输出
注意:我还重新编写了scipy dendogram测试函数,以准确说明绘图时z连杆失效的位置和原因
Iris数据,3个集群输出 Iris数据,10个集群输出
COST
84.20375254574043
CLUSTER CENTERS
Sepal_Length Sepal_Width Petal_Length Petal_Width
5.01 3.37 1.56 0.29
5.95 2.77 4.45 1.45
6.85 3.07 5.74 2.07
ADJACENCY MATRIX
FromNodeID toNodeID distance
0 1 2.540618378626947
0 2 2.540618378626947
2 3 1.044390196577994
2 4 1.044390196577994
LOG OUTPUTS
Feature dimension: 4.
Number of points: 150.
Initial cost: 681.3705999999911.
The minimum number of points of a divisible cluster is 1.
Dividing 1 clusters on level 1.
Dividing 1 clusters on level 2.
The divisible clusters needed for this iteration were : d = 1, cost =681.3705999999911, size = 150
The divisible clusters needed for this iteration were : d = 3, cost =123.79587628866193, size = 97
LINKAGE MATRIX
node1 node2 distance tree_size
1 2 1.044390196577994 2
0 3 2.540618378626947 3
SCIPY DENDOGRAM
COST
27.981293071222368
CLUSTER CENTERS (rounded)
Sepal_Length Sepal_Width Petal_Length Petal_Width
4.68 3.08 1.45 0.2
5 2.4 3.2 1.03
5.07 3.46 1.44 0.28
5.4 3.89 1.51 0.27
5.6 2.66 4.05 1.25
6.01 2.71 4.95 1.79
6.4 2.97 4.55 1.41
6.49 2.9 5.37 1.8
6.61 3.16 5.57 2.29
7.48 3.13 6.3 2.05
ADJACENCY MATRIX
fromNodeID toNodeID distance
0 4 1.7986418455383477
0 5 1.7986418455383477
1 6 0.32116390552184665
1 7 0.32116390552184665
2 0 0.48992124227033834
2 1 0.48992124227033834
3 2 2.540618378626947
3 10 2.540618378626947
8 12 0.36029111643410283
8 13 0.36029111643410283
LOG OUTPUTS
Feature dimension: 4.
Number of points: 150.
Initial cost: 681.3705999999911.
The minimum number of points of a divisible cluster is 1.
Dividing 1 clusters on level 1.
Dividing 2 clusters on level 2.
Dividing 4 clusters on level 3.
Dividing 2 clusters on level 4.
d =The divisible clusters needed for this iteration were :
d = 1, cost =681.3705999999911, size = 150
The divisible clusters needed for this iteration were :
d = 3, cost =123.79587628866193, size = 97
The divisible clusters needed for this iteration were :
d = 4, cost =13.72863636363627, size = 22
The divisible clusters needed for this iteration were :
d = 13, cost =10.73588235294028, size = 34
LINKAGE MATRIX
node1 node2 distance tree_size
2 3 0.32116390552184665 2
7 8 0.34473006617374347 2
5 6 0.36029111643410283 2
17 10 0.48992124227033834 4
4 12 0.5802085628837165 3
11 9 0.839611851358609 3
14 15 1.044390196577994 6
0 1 1.7986418455383477 2
13 16 2.540618378626947 10
SCIPY DENDOGRAM FAILURE TEST FUNCTION OUTPUTS
[2. 3. 0.32116391 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
2.0 >= 10+0
= False
Checking .... if indice B >= # of clusters + iteration we are on
3.0 >= 10+0
= False
-------------------------------------------------
[7. 8. 0.34473007 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
7.0 >= 10+1
= False
Checking .... if indice B >= # of clusters + iteration we are on
8.0 >= 10+1
= False
-------------------------------------------------
[5. 6. 0.36029112 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
5.0 >= 10+2
= False
Checking .... if indice B >= # of clusters + iteration we are on
6.0 >= 10+2
= False
-------------------------------------------------
[17. 10. 0.48992124 4. ]
Checking.... if indice A >= # of clusters + iteration we are on
17.0 >= 10+3
= True
Checking .... if indice B >= # of clusters + iteration we are on
10.0 >= 10+3
= False
-------------------------------------------------
[ 4. 12. 0.58020856 3. ]
Checking.... if indice A >= # of clusters + iteration we are on
4.0 >= 10+4
= False
Checking .... if indice B >= # of clusters + iteration we are on
12.0 >= 10+4
= False
-------------------------------------------------
[11. 9. 0.83961185 3. ]
Checking.... if indice A >= # of clusters + iteration we are on
11.0 >= 10+5
= False
Checking .... if indice B >= # of clusters + iteration we are on
9.0 >= 10+5
= False
-------------------------------------------------
[14. 15. 1.0443902 6. ]
Checking.... if indice A >= # of clusters + iteration we are on
14.0 >= 10+6
= False
Checking .... if indice B >= # of clusters + iteration we are on
15.0 >= 10+6
= False
-------------------------------------------------
[0. 1. 1.79864185 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
0.0 >= 10+7
= False
Checking .... if indice B >= # of clusters + iteration we are on
1.0 >= 10+7
= False
-------------------------------------------------
[13. 16. 2.54061838 10. ]
Checking.... if indice A >= # of clusters + iteration we are on
13.0 >= 10+8
= False
Checking .... if indice B >= # of clusters + iteration we are on
16.0 >= 10+8
= False
-------------------------------------------------