Apache spark 使用Spark数据集为线性回归设置X1 Xn和Y_Apache Spark_Apache Spark Sql

Apache spark 使用Spark数据集为线性回归设置X1 Xn和Y

apache-spark

Apache spark 使用Spark数据集为线性回归设置X1 Xn和Y,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我是Spark的新手，刚刚使用DataSet.show（）据我几年前在大学里所知，除了W_12_26之外，所有栏目都是我的xIn值，W_12_26本身就是我的yIn 通过检查，我有点困惑，我设置了我的专栏xIn和yIn来建立这个模型和预测如果能让我走上正轨，我将不胜感激。我建议您使用管道。这提供了很大的灵活性。请参阅以下示例： from pyspark.ml import Pipeline,PipelineModel from pyspark.ml.feature import Vecto

我是Spark的新手，刚刚使用

DataSet.show（）

据我几年前在大学里所知，除了

W_12_26

之外，所有栏目都是我的xIn值，

W_12_26

本身就是我的yIn

通过检查，我有点困惑，我设置了我的专栏xIn和yIn来建立这个模型和预测

如果能让我走上正轨，我将不胜感激。

我建议您使用管道。这提供了很大的灵活性。请参阅以下示例：

from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.feature import VectorAssembler,MinMaxScaler
from pyspark.sql.types import *
# Test data
df = sqlContext.createDataFrame([(1,2,3,4,5,6,7,8,9,10),(2,3,4,5,6,7,8,9,10,11)],schema=["a","b","c","d","e","f","g","h","i","label"])
# assemble all the input features as a single vector
vecAssembler = VectorAssembler(inputCols=[x for x in df.columns if x not in 'label'], outputCol="features",handleInvalid='skip')
# scale all the inputs in a given range - optional
normalizer = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1)
# define the classifier as needed
classifier = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=40)
# Create a pipeline with the needed stages
pipeline_test = Pipeline(stages=[vecAssembler,normalizer,classifier])

# train the model - This can be saved and loaded
pipeline_trained = pipeline_test.fit(df) # split the data here and use train set
# prediction - this gives a dataframe with the results. can be used for evaluation
results = pipeline_trained.transform(df) # split the data here and use test set

from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.feature import VectorAssembler,MinMaxScaler
from pyspark.sql.types import *
# Test data
df = sqlContext.createDataFrame([(1,2,3,4,5,6,7,8,9,10),(2,3,4,5,6,7,8,9,10,11)],schema=["a","b","c","d","e","f","g","h","i","label"])
# assemble all the input features as a single vector
vecAssembler = VectorAssembler(inputCols=[x for x in df.columns if x not in 'label'], outputCol="features",handleInvalid='skip')
# scale all the inputs in a given range - optional
normalizer = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1)
# define the classifier as needed
classifier = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=40)
# Create a pipeline with the needed stages
pipeline_test = Pipeline(stages=[vecAssembler,normalizer,classifier])

# train the model - This can be saved and loaded
pipeline_trained = pipeline_test.fit(df) # split the data here and use train set
# prediction - this gives a dataframe with the results. can be used for evaluation
results = pipeline_trained.transform(df) # split the data here and use test set