Python 在Spark ML/pyspark中以编程方式创建特征向量
我想知道如果我在多个数字列中拥有这些特性,是否有一种简洁的方法可以在pyspark的数据帧上运行ML(例如KMeans) 即,与Python 在Spark ML/pyspark中以编程方式创建特征向量,python,apache-spark,pyspark,apache-spark-ml,Python,Apache Spark,Pyspark,Apache Spark Ml,我想知道如果我在多个数字列中拥有这些特性,是否有一种简洁的方法可以在pyspark的数据帧上运行ML(例如KMeans) 即,与Iris数据集中一样: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) 我希望使用KMeans,而不必重新创建数据集,将特征向量手动添加为新列,并在代码中重复硬编码原始列 我想改进的解决方案是: from pyspark.mllib.linalg
Iris
数据集中一样:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
我希望使用KMeans,而不必重新创建数据集,将特征向量手动添加为新列,并在代码中重复硬编码原始列
我想改进的解决方案是:
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel
iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()
kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)
predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)
我正在寻找一种解决方案,类似于:
feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>
feature_cols=[“a1”、“a2”、“a3”、“a4”]
预测\u col\u name=“预测”
您可以使用:
可使用ML管道将其与k-means组合:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)