Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用Scala聚合Spark数据帧以获得稀疏向量?_Scala_Apache Spark_Spark Dataframe - Fatal编程技术网

如何使用Scala聚合Spark数据帧以获得稀疏向量?

如何使用Scala聚合Spark数据帧以获得稀疏向量?,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,我有一个类似Spark中下面的数据帧,我想按id列对其进行分组,然后对于分组数据中的每一行,我需要创建一个稀疏向量,其中包含weight列中的元素,索引由index列指定。稀疏向量的长度是已知的,本例中为1000 数据帧df: +-----+------+-----+ | id|weight|index| +-----+------+-----+ |11830| 1| 8| |11113| 1| 3| | 1081| 1| 3| | 2654|

我有一个类似Spark中下面的数据帧,我想按
id
列对其进行分组,然后对于分组数据中的每一行,我需要创建一个稀疏向量,其中包含
weight
列中的元素,索引由
index
列指定。稀疏向量的长度是已知的,本例中为1000

数据帧
df

+-----+------+-----+
|   id|weight|index|
+-----+------+-----+
|11830|     1|    8|
|11113|     1|    3|
| 1081|     1|    3|
| 2654|     1|    3|
|10633|     1|    3|
|11830|     1|   28|
|11351|     1|   12|
| 2737|     1|   26|
|11113|     3|    2|
| 6590|     1|    2|
+-----+------+-----+
我已经读过一些类似于我想做的事情,只是为了rdd。有人知道使用Scala为Spark中的数据帧执行此操作的好方法吗

到目前为止,我的尝试是首先收集如下列表中的权重和索引:

val dfWithLists = df
    .groupBy("id")
    .agg(collect_list("weight") as "weights", collect_list("index") as "indices"))
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.functions.udf

def toSparseVector: ((Array[Int], Array[BigInt]) => Vector) = {(a1, a2) => Vectors.sparse(1000, a1, a2.map(x => x.toDouble))}
val udfToSparseVector = udf(toSparseVector)

val dfWithSparseVector = dfWithLists.withColumn("SparseVector", udfToSparseVector($"indices", $"weights"))
这看起来像:

+-----+---------+----------+
|   id|  weights|   indices|
+-----+---------+----------+
|11830|   [1, 1]|   [8, 28]|
|11113|   [1, 3]|    [3, 2]|
| 1081|      [1]|       [3]|
| 2654|      [1]|       [3]|
|10633|      [1]|       [3]|
|11351|      [1]|      [12]|
| 2737|      [1]|      [26]|
| 6590|      [1]|       [2]|
+-----+---------+----------+
然后我定义一个udf并执行如下操作:

val dfWithLists = df
    .groupBy("id")
    .agg(collect_list("weight") as "weights", collect_list("index") as "indices"))
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.functions.udf

def toSparseVector: ((Array[Int], Array[BigInt]) => Vector) = {(a1, a2) => Vectors.sparse(1000, a1, a2.map(x => x.toDouble))}
val udfToSparseVector = udf(toSparseVector)

val dfWithSparseVector = dfWithLists.withColumn("SparseVector", udfToSparseVector($"indices", $"weights"))
但这似乎不起作用,而且感觉应该有一种更简单的方法来实现这一点,而不需要首先收集列表的权重和索引


我对Spark、Dataframes和Scala都是新手,所以非常感谢您的帮助

您必须收集它们,因为向量必须是本地的、单机的:

要创建稀疏向量,有两个选项,使用无序(索引、值)对或指定索引和值数组:

如果可以将数据转换为其他格式(数据透视),还可以使用矢量汇编程序:

通过一些小的调整,您可以让您的方法发挥作用:

:paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

val df = Seq((11830,1,8), (11113, 1, 3), (1081, 1,3), (2654, 1, 3), (10633, 1, 3), (11830, 1, 28), (11351, 1, 12), (2737, 1, 26), (11113, 3, 2), (6590, 1, 2)).toDF("id", "weight", "index")

val dfWithFeat = df
  .rdd
  .map(r => (r.getInt(0), (r.getInt(2), r.getInt(1).toDouble)))
  .groupByKey()
  .map(r => LabeledPoint(r._1, Vectors.sparse(1000, r._2.toSeq)))
  .toDS

dfWithFeat.printSchema
dfWithFeat.show(10, false)


// Exiting paste mode, now interpreting.

root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)

+-------+-----------------------+
|label  |features               |
+-------+-----------------------+
|11113.0|(1000,[2,3],[3.0,1.0]) |
|2737.0 |(1000,[26],[1.0])      |
|10633.0|(1000,[3],[1.0])       |
|1081.0 |(1000,[3],[1.0])       |
|6590.0 |(1000,[2],[1.0])       |
|11830.0|(1000,[8,28],[1.0,1.0])|
|2654.0 |(1000,[3],[1.0])       |
|11351.0|(1000,[12],[1.0])      |
+-------+-----------------------+

dfWithFeat: org.apache.spark.sql.Dataset[org.apache.spark.mllib.regression.LabeledPoint] = [label: double, features: vector]

非常感谢。当索引向量按严格的递增顺序排列时,此操作有效。如果索引向量没有排序,有没有办法做到这一点?我得到了这个错误:java.lang.IllegalArgumentException:requirement失败:索引324跟在660后面,并且不是严格递增的。它现在使用一系列未排序的对(索引,权重)来创建向量,它们的顺序应该不再重要。