带分类变量的PySpark K-均值
我开始在pyspark(v1.6.2)中使用kmeans集群,使用以下示例,其中包括混合变量类型:带分类变量的PySpark K-均值,pyspark,cluster-analysis,apache-spark-mllib,Pyspark,Cluster Analysis,Apache Spark Mllib,我开始在pyspark(v1.6.2)中使用kmeans集群,使用以下示例,其中包括混合变量类型: # Import libraries from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler from pyspark.ml.clustering import KMeans from pyspark.ml import Pipeline from pyspark.mllib.clustering i
# Import libraries
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.mllib.clustering import KMeansModel
# Create sample DF
sample = sqlContext.createDataFrame([["a@email.com", 12000,"M"],
["b@email.com", 43000,"M"],
["c@email.com", 5000,"F"],
["d@email.com", 60000,"M"]]
, ["email", "income","gender"])
我使用了StringIndexer
、onehotcoder
和vectorsemployer
来处理分类属性,如下所示:
# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']
encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']
# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")
kmeans = KMeans()
.setK(2)
.setFeaturesCol("features")
.setPredictionCol("prediction")
kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)
oo.select('email', 'income', 'gender', 'features',
'prediction').show(truncate = False)
+-----------+------+------+-------------------------+----------+
|email |income|gender|features |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M |[0.0,1.0,0.0,1.0,12000.0]|1 |
|b@email.com|43000 |M |(5,[3,4],[1.0,43000.0]) |0 |
|c@email.com|5000 |F |(5,[0,4],[1.0,5000.0]) |1 |
|d@email.com|60000 |M |[0.0,0.0,1.0,1.0,60000.0]|0 |
+-----------+------+------+-------------------------+----------+
这件作品确保了转换的顺利进行:
pipeline= Pipeline(stages=string_indexers+encoders+[assembler])
model=pipeline.fit(sample)
indexed=model.transform(sample)
indexed.show()
我知道我可以通过这样做来对这个转换后的DF进行k-means:
# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']
encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']
# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")
kmeans = KMeans()
.setK(2)
.setFeaturesCol("features")
.setPredictionCol("prediction")
kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)
oo.select('email', 'income', 'gender', 'features',
'prediction').show(truncate = False)
+-----------+------+------+-------------------------+----------+
|email |income|gender|features |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M |[0.0,1.0,0.0,1.0,12000.0]|1 |
|b@email.com|43000 |M |(5,[3,4],[1.0,43000.0]) |0 |
|c@email.com|5000 |F |(5,[0,4],[1.0,5000.0]) |1 |
|d@email.com|60000 |M |[0.0,0.0,1.0,1.0,60000.0]|0 |
+-----------+------+------+-------------------------+----------+
但我想看看:
1) 如何使用pyspark.mllib.clustering.KMeansModel执行相同的操作,以确定K的最佳(最低成本)值(与中的KMeans.train和computeCost函数对齐)
2) 如何获得原始比例中的聚类中心(表示不在编码比例中的“男性”或“女性”标签)
PySpark 1.6.2版对您的集群来说唯一重要的属性是“收入”。所有其他的都可以忽略不计。请确保将输入标准化,因为K-means是一种基于距离度量的算法。例如,使用标准定标器。