如何在Spark(Scala)中还原一个热Enoding

如何在Spark(Scala)中还原一个热Enoding,scala,apache-spark,cluster-analysis,apache-spark-mllib,one-hot-encoding,Scala,Apache Spark,Cluster Analysis,Apache Spark Mllib,One Hot Encoding,在运行k-means(mllib spark scala)之后,我想弄清楚我从使用mllib的OneHotEncoder预处理的数据中获得的集群中心 一个中心看起来像这样: 群集中心0[0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.

在运行k-means(mllib spark scala)之后,我想弄清楚我从使用mllib的OneHotEncoder预处理的数据中获得的集群中心

一个中心看起来像这样:

群集中心0[0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]

这显然不是非常人性化的…关于如何还原一个热门编码并检索原始分类特征,有什么想法吗?
如果我寻找距离质心最近的数据点(使用k-means使用的相同距离度量,我假设是欧几里德距离),然后恢复该特定数据点的编码,会怎么样?

对于簇质心,这是不可能的(强烈不推荐)要反转编码。假设您拥有6个原始功能“3”,并且它被编码为
[0.0,0.0,1.0,0.0,0.0,0.0]
。在这种情况下,很容易从编码中提取3作为正确的功能

但在kmeans应用程序之后,您可能会得到一个群集质心,它会像这样查找此功能
[0.0,0.13,0.0,0.77,0.1,0.0]
。如果您想将其解码回以前的表示形式,如6中的“4”,因为功能4的值最大,那么您将丢失信息,模型可能会损坏

编辑:添加一种可能的方法,将注释中的数据点编码还原为答案


如果您在数据点上有ID,则可以在将数据点分配给群集以获取旧状态后,在编码之前对ID执行选择/加入操作。

谢谢!我理解您的答案。如果我查找最接近的数据点,该怎么办(使用k-means使用的相同距离度量,我假设是欧几里德距离)到质心,然后还原该特定数据点的编码?@JoãoMoura然后我认为最简单的方法是在每个数据点上都有ID,并在为其群集分配一个点后通过ID检索原始值。然后不需要还原编码,而是对原始数据点执行简单的选择/连接操作和编码的数据集。