带分类变量的PySpark K-均值_Pyspark_Cluster Analysis_Apache Spark Mllib

带分类变量的PySpark K-均值

pyspark

带分类变量的PySpark K-均值,pyspark,cluster-analysis,apache-spark-mllib,Pyspark,Cluster Analysis,Apache Spark Mllib,我开始在pyspark（v1.6.2）中使用kmeans集群，使用以下示例，其中包括混合变量类型： # Import libraries from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler from pyspark.ml.clustering import KMeans from pyspark.ml import Pipeline from pyspark.mllib.clustering i

我开始在pyspark（v1.6.2）中使用kmeans集群，使用以下示例，其中包括混合变量类型：

# Import libraries
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.mllib.clustering import KMeansModel

# Create sample DF
sample = sqlContext.createDataFrame([["a@email.com", 12000,"M"],
                                ["b@email.com", 43000,"M"],
                                ["c@email.com", 5000,"F"],
                                ["d@email.com", 60000,"M"]]
                                , ["email", "income","gender"])

我使用了

StringIndexer

、

onehotcoder

和

vectorsemployer

来处理分类属性，如下所示：

# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']

encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']

# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")

kmeans = KMeans()
          .setK(2)
          .setFeaturesCol("features")
          .setPredictionCol("prediction")

kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)

oo.select('email', 'income', 'gender', 'features', 
'prediction').show(truncate = False)

+-----------+------+------+-------------------------+----------+
|email      |income|gender|features                 |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M     |[0.0,1.0,0.0,1.0,12000.0]|1         |
|b@email.com|43000 |M     |(5,[3,4],[1.0,43000.0])  |0         |
|c@email.com|5000  |F     |(5,[0,4],[1.0,5000.0])   |1         |
|d@email.com|60000 |M     |[0.0,0.0,1.0,1.0,60000.0]|0         |
+-----------+------+------+-------------------------+----------+

这件作品确保了转换的顺利进行：

pipeline= Pipeline(stages=string_indexers+encoders+[assembler])
model=pipeline.fit(sample)
indexed=model.transform(sample)

indexed.show()

我知道我可以通过这样做来对这个转换后的DF进行k-means：

# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']

encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']

# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")

kmeans = KMeans()
          .setK(2)
          .setFeaturesCol("features")
          .setPredictionCol("prediction")

kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)

oo.select('email', 'income', 'gender', 'features', 
'prediction').show(truncate = False)

+-----------+------+------+-------------------------+----------+
|email      |income|gender|features                 |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M     |[0.0,1.0,0.0,1.0,12000.0]|1         |
|b@email.com|43000 |M     |(5,[3,4],[1.0,43000.0])  |0         |
|c@email.com|5000  |F     |(5,[0,4],[1.0,5000.0])   |1         |
|d@email.com|60000 |M     |[0.0,0.0,1.0,1.0,60000.0]|0         |
+-----------+------+------+-------------------------+----------+

但我想看看：

1）如何使用pyspark.mllib.clustering.KMeansModel执行相同的操作，以确定K的最佳（最低成本）值（与中的KMeans.train和computeCost函数对齐）

2）如何获得原始比例中的聚类中心（表示不在编码比例中的“男性”或“女性”标签）

PySpark 1.6.2版

对您的集群来说唯一重要的属性是“收入”。所有其他的都可以忽略不计。请确保将输入标准化，因为K-means是一种基于距离度量的算法。例如，使用标准定标器。