Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala spark:如何训练分布式稀疏回归模型?_Scala_Apache Spark_Linear Regression_Sparse Matrix_Apache Spark Ml - Fatal编程技术网

Scala spark:如何训练分布式稀疏回归模型?

Scala spark:如何训练分布式稀疏回归模型?,scala,apache-spark,linear-regression,sparse-matrix,apache-spark-ml,Scala,Apache Spark,Linear Regression,Sparse Matrix,Apache Spark Ml,我试图建立一个回归模型,其中基本特征矩阵非常大(418K行,73K列),并且非常稀疏(58M个非零值,约占整个矩阵的0.2%) 我将矩阵坐标表示为一个数据帧,其中第一列是行坐标I,第二列是列坐标j,第三列是{I,j}第四位的值 例如,以下矩阵: +-+-+-+ |0|1|0| |2|0|0| |0|0|3| +-+-+-+ 代表为 +-+-+-----+ |i|j|value| +-+-+-----+ |0|1| 1 | |1|0| 2 | |2|2| 3 | +-+-+-----

我试图建立一个回归模型,其中基本特征矩阵非常大(418K行,73K列),并且非常稀疏(58M个非零值,约占整个矩阵的0.2%)

我将矩阵坐标表示为一个数据帧,其中第一列是行坐标
I
,第二列是列坐标
j
,第三列是
{I,j}
第四位的值

例如,以下矩阵:

+-+-+-+
|0|1|0|
|2|0|0|
|0|0|3|
+-+-+-+
代表为

+-+-+-----+
|i|j|value|
+-+-+-----+
|0|1| 1   |
|1|0| 2   |
|2|2| 3   |
+-+-+-----+
我有一个单独的数据框,包含每一行的标签
I


如果可能,我更喜欢使用较新的
ml
库而不是较旧的
mllib

下面我给出了一个小代码示例,说明如何在
spark ml
中实现分布式稀疏线性回归。我在一个大型集群(Databricks运行时版本6.5 ML-包括ApacheSpark 2.4.5和Scala 2.11)上使用了这个矩阵,因此它可以很好地扩展,只需几分钟即可执行

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.feature.LabeledPoint
import spark.implicits._
import org.apache.spark.ml.regression.LinearRegression

// Construct Matrix coordinate representation DataFrame
val df = Seq(
  (0, 1, 14.0), 
  (0, 0, 13.0), 
  (1, 1, 11.0)
).toDF("i", "j", "value")

df.show()

+---+---+-----+
|  i|  j|value|
+---+---+-----+
|  0|  1| 14.0|
|  0|  0| 13.0|
|  1|  1| 11.0|
+---+---+-----+

// Construct label DataFrame
val df_label = Seq(
  (0, 41.1), 
  (1, 21.9) // beta_1 = 1, beta_2 = 2
).toDF("i", "label")

df_label.show()

+---+-----+
|  i|label|
+---+-----+
|  0| 41.1|
|  1| 21.9|
+---+-----+

// Use a UDF to sort arrays below
val sortUdf: UserDefinedFunction = udf((rows: Seq[Row]) => {
  rows.map { case Row(j: Int, value: Double) => (j, value) }
    .sortBy { case (j, value) => j }
})

// collect j and value columns to lists, make sure they are sorted by j
// then join with labels
val df_collected_with_labels = df
.groupBy("i")
.agg(collect_list(struct("j", "value")) as "j_value")
.select($"i", sortUdf(col("j_value")).alias("j_value_list"))
.withColumn("j_list", $"j_value_list".getField("_1"))
.withColumn("value_list", $"j_value_list".getField("_2"))
.drop("j_value_list")
.join(df_label, "i")

df_collected_with_labels.show()
+---+------+------------+-----+
|  i|j_list|  value_list|label|
+---+------+------------+-----+
|  1|   [1]|      [11.0]| 21.9|
|  0|[0, 1]|[13.0, 14.0]| 41.1|
+---+------+------------+-----+

val unique_j = df.dropDuplicates("j").count().toInt

val sparse_df = df_collected_with_labels
.map(r => LabeledPoint(r.getDouble(3), 
                       new SparseVector(size = unique_j, 
                                        indices = r.getAs[Seq[Int]]("j_list").toArray, 
                                        values = r.getAs[Seq[Double]]("value_list").toArray)))

sparse_df.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
| 21.9|      (2,[1],[11.0])|
| 41.1|(2,[0,1],[13.0,14...|
+-----+--------------------+

// Fit sparse regression!
val lr = new LinearRegression()
.setFitIntercept(false)

val lrModel = lr.fit(sparse_df)

lrModel.coefficients
org.apache.spark.ml.linalg.Vector = [1.0174825174825193,1.9909090909090894]

lrModel.predict(new SparseVector(size = unique_j, indices = Array(0), values = Array(4.0)))
Double = 4.069930069930077