带分类变量的PySpark K-均值

带分类变量的PySpark K-均值,pyspark,cluster-analysis,apache-spark-mllib,Pyspark,Cluster Analysis,Apache Spark Mllib,我开始在pyspark(v1.6.2)中使用kmeans集群,使用以下示例,其中包括混合变量类型: # Import libraries from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler from pyspark.ml.clustering import KMeans from pyspark.ml import Pipeline from pyspark.mllib.clustering i

我开始在pyspark(v1.6.2)中使用kmeans集群,使用以下示例,其中包括混合变量类型:

# Import libraries
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.mllib.clustering import KMeansModel

# Create sample DF
sample = sqlContext.createDataFrame([["a@email.com", 12000,"M"],
                                ["b@email.com", 43000,"M"],
                                ["c@email.com", 5000,"F"],
                                ["d@email.com", 60000,"M"]]
                                , ["email", "income","gender"])
我使用了
StringIndexer
onehotcoder
vectorsemployer
来处理分类属性,如下所示:

# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']

encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']

# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")
kmeans = KMeans()
          .setK(2)
          .setFeaturesCol("features")
          .setPredictionCol("prediction")

kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)

oo.select('email', 'income', 'gender', 'features', 
'prediction').show(truncate = False)

+-----------+------+------+-------------------------+----------+
|email      |income|gender|features                 |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M     |[0.0,1.0,0.0,1.0,12000.0]|1         |
|b@email.com|43000 |M     |(5,[3,4],[1.0,43000.0])  |0         |
|c@email.com|5000  |F     |(5,[0,4],[1.0,5000.0])   |1         |
|d@email.com|60000 |M     |[0.0,0.0,1.0,1.0,60000.0]|0         |
+-----------+------+------+-------------------------+----------+
这件作品确保了转换的顺利进行:

pipeline= Pipeline(stages=string_indexers+encoders+[assembler])
model=pipeline.fit(sample)
indexed=model.transform(sample)

indexed.show()
我知道我可以通过这样做来对这个转换后的DF进行k-means:

# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']

encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']

# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")
kmeans = KMeans()
          .setK(2)
          .setFeaturesCol("features")
          .setPredictionCol("prediction")

kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)

oo.select('email', 'income', 'gender', 'features', 
'prediction').show(truncate = False)

+-----------+------+------+-------------------------+----------+
|email      |income|gender|features                 |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M     |[0.0,1.0,0.0,1.0,12000.0]|1         |
|b@email.com|43000 |M     |(5,[3,4],[1.0,43000.0])  |0         |
|c@email.com|5000  |F     |(5,[0,4],[1.0,5000.0])   |1         |
|d@email.com|60000 |M     |[0.0,0.0,1.0,1.0,60000.0]|0         |
+-----------+------+------+-------------------------+----------+
但我想看看:

1) 如何使用pyspark.mllib.clustering.KMeansModel执行相同的操作,以确定K的最佳(最低成本)值(与中的KMeans.train和computeCost函数对齐)

2) 如何获得原始比例中的聚类中心(表示不在编码比例中的“男性”或“女性”标签)


PySpark 1.6.2版

对您的集群来说唯一重要的属性是“收入”。所有其他的都可以忽略不计。请确保将输入标准化,因为K-means是一种基于距离度量的算法。例如,使用标准定标器。