Apache spark 使用Spark数据集为线性回归设置X1 Xn和Y

Apache spark 使用Spark数据集为线性回归设置X1 Xn和Y,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我是Spark的新手,刚刚使用DataSet.show() 据我几年前在大学里所知,除了W_12_26之外,所有栏目都是我的xIn值,W_12_26本身就是我的yIn 通过检查,我有点困惑,我设置了我的专栏xIn和yIn来建立这个模型和预测 如果能让我走上正轨,我将不胜感激。我建议您使用管道。这提供了很大的灵活性。请参阅以下示例: from pyspark.ml import Pipeline,PipelineModel from pyspark.ml.feature import Vecto

我是Spark的新手,刚刚使用
DataSet.show()

据我几年前在大学里所知,除了
W_12_26
之外,所有栏目都是我的xIn值,
W_12_26
本身就是我的yIn

通过检查,我有点困惑,我设置了我的专栏xIn和yIn来建立这个模型和预测


如果能让我走上正轨,我将不胜感激。

我建议您使用管道。这提供了很大的灵活性。请参阅以下示例:

from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.feature import VectorAssembler,MinMaxScaler
from pyspark.sql.types import *
# Test data
df = sqlContext.createDataFrame([(1,2,3,4,5,6,7,8,9,10),(2,3,4,5,6,7,8,9,10,11)],schema=["a","b","c","d","e","f","g","h","i","label"])
# assemble all the input features as a single vector
vecAssembler = VectorAssembler(inputCols=[x for x in df.columns if x not in 'label'], outputCol="features",handleInvalid='skip')
# scale all the inputs in a given range - optional
normalizer = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1)
# define the classifier as needed
classifier = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=40)
# Create a pipeline with the needed stages
pipeline_test = Pipeline(stages=[vecAssembler,normalizer,classifier])

# train the model - This can be saved and loaded
pipeline_trained = pipeline_test.fit(df) # split the data here and use train set
# prediction - this gives a dataframe with the results. can be used for evaluation
results = pipeline_trained.transform(df) # split the data here and use test set
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.feature import VectorAssembler,MinMaxScaler
from pyspark.sql.types import *
# Test data
df = sqlContext.createDataFrame([(1,2,3,4,5,6,7,8,9,10),(2,3,4,5,6,7,8,9,10,11)],schema=["a","b","c","d","e","f","g","h","i","label"])
# assemble all the input features as a single vector
vecAssembler = VectorAssembler(inputCols=[x for x in df.columns if x not in 'label'], outputCol="features",handleInvalid='skip')
# scale all the inputs in a given range - optional
normalizer = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1)
# define the classifier as needed
classifier = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=40)
# Create a pipeline with the needed stages
pipeline_test = Pipeline(stages=[vecAssembler,normalizer,classifier])

# train the model - This can be saved and loaded
pipeline_trained = pipeline_test.fit(df) # split the data here and use train set
# prediction - this gives a dataframe with the results. can be used for evaluation
results = pipeline_trained.transform(df) # split the data here and use test set