Arrays 在Spark数据帧中创建随机要素数组

Arrays 在Spark数据帧中创建随机要素数组,arrays,scala,apache-spark,dataframe,vector,Arrays,Scala,Apache Spark,Dataframe,Vector,创建ALS模型时,我们可以提取userFactorsDataFrame和itemFactorsDataFrame。这些数据帧包含一个带有数组的列 我想生成一些随机数据,并将其合并到userFactorsDataFrame 这是我的密码: val df1: DataFrame = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating") val m

创建
ALS
模型时,我们可以提取
userFactors
DataFrame和
itemFactors
DataFrame。这些数据帧包含一个带有数组的列

我想生成一些随机数据,并将其合并到
userFactors
DataFrame

这是我的密码:

 val df1: DataFrame  = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating")
val model1 = (new ALS()
 .setImplicitPrefs(true)
 .fit(df1))

val iF = model1.itemFactors
val uF = model1.userFactors
然后,我使用带有以下函数的
矢量汇编程序创建一个随机数据帧:

def makeNew(df: DataFrame, rank: Int): DataFrame = {
    var df_dummy = df
    var i: Int = 0
    var inputCols: Array[String] = Array()
    for (i <- 0 to rank) {
       df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
       inputCols = inputCols :+ "feature".concat(i.toString)
      }
    val assembler = new VectorAssembler()
      .setInputCols(inputCols)
      .setOutputCol("userFeatures")
    val output = assembler.transform(df_dummy)
    output.select("user", "userFeatures")
  }
当我合并这两个数据帧时,问题就出现了

usersFactorsNew.union(uF)
生成错误:

 org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> array<float> at the second column of the second table;;

也许
矢量汇编程序
不是此任务的最佳选择。然而,目前,这是我找到的唯一选择。我很想得到一些更好的建议。

您可以直接使用
UDF
而不是创建虚拟数据帧并使用
VectorAssembler
生成随机特征向量。来自
ALS
模型的
userFactors
将返回一个
数组[Float]
,因此
UDF
的输出应该与之匹配

val createRandomArray = udf((rank: Int) => {
  Array.fill(rank)(Random.nextFloat)
})
请注意,这将给出间隔[0.0,1.0]的数字(与问题中使用的
rand()
相同),如果需要其他数字,请根据需要进行修改

使用秩为3和
userDf

val usersFactorsNew = usersDf.withColumn("userFeatures", createRandomArray(lit(3)))
将给出如下数据帧(当然是随机特征值)

现在可以将此数据帧与
uF
数据帧连接起来


UDF
不起作用的原因应该是因为它是一个
数组[Double],而您需要一个
union
数组[Float]
。应该可以使用
map(u.toFloat)`进行修复


您的所有过程都是正确的。甚至
udf
功能也能成功工作。您只需将
makeNew
函数的最后一部分更改为

def makeNew(df: DataFrame, rank: Int): DataFrame = {
  var df_dummy = df
  var i: Int = 0
  var inputCols: Array[String] = Array()
  for (i <- 0 to rank) {
    df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
    inputCols = inputCols :+ "feature".concat(i.toString)
  }
  val assembler = new VectorAssembler()
    .setInputCols(inputCols)
    .setOutputCol("userFeatures")
  val output = assembler.transform(df_dummy)
  output.select(col("id"), toArrUdf(col("userFeatures")).as("features"))
}
你应该得到

+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|567|[0.8259185719733708, 0.327713892339658, 0.049547223031371046, 0.056661808506210054, 0.5846626163454274, 0.038497936270104005, 0.8970865088803417, 0.8840660648882804, 0.837866669938156, 0.9395263094918058, 0.09179528484355126, 0.4915430644129799, 0.11083447052043116, 0.5122858182953718, 0.4302683812966408, 0.3862741815833828, 0.6189322403095068, 0.3000371006293433, 0.09331299668168902, 0.7421838728601371, 0.855867963988993]|
|678|[0.7686514248005568, 0.5473580740023187, 0.072945344124282, 0.36648594574355287, 0.9780202082328863, 0.5289221651923784, 0.3719451099963028, 0.2824660794505932, 0.4873197501260199, 0.9364676464120849, 0.011539929543513794, 0.5240615794930654, 0.6282546154521298, 0.995256022569878, 0.6659179561266975, 0.8990775317754092, 0.08650071017556926, 0.5190186149992805, 0.056345335742325475, 0.6465357505620791, 0.17913532817943245] |
|123|[0.04177388548851013, 0.26762014627456665, -0.19617630541324615, 0.34298020601272583, 0.19632814824581146, -0.2748605012893677, 0.07724890112876892, 0.4277132749557495, 0.1927199512720108, -0.40271613001823425]                                                                                                                                                                                                                        |
|234|[0.04139673709869385, 0.26520395278930664, -0.19440513849258423, 0.3398836553096771, 0.1945556253194809, -0.27237895131111145, 0.07655145972967148, 0.42385169863700867, 0.19098000228405, -0.39908021688461304]                                                                                                                                                                                                                          |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+----+----------------------------------------------------------+
|user|userFeatures                                              |
+----+----------------------------------------------------------+
|567 |[0.6866711267486822,0.7257031656127676,0.983562255688249] |
|678 |[0.7013908820314967,0.41029552817665327,0.554591149586789]|
+----+----------------------------------------------------------+
val toArr: org.apache.spark.ml.linalg.Vector => Array[Float] = _.toArray.map(_.toFloat)
val toArrUdf = udf(toArr)
def makeNew(df: DataFrame, rank: Int): DataFrame = {
  var df_dummy = df
  var i: Int = 0
  var inputCols: Array[String] = Array()
  for (i <- 0 to rank) {
    df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
    inputCols = inputCols :+ "feature".concat(i.toString)
  }
  val assembler = new VectorAssembler()
    .setInputCols(inputCols)
    .setOutputCol("userFeatures")
  val output = assembler.transform(df_dummy)
  output.select(col("id"), toArrUdf(col("userFeatures")).as("features"))
}
val usersDf: DataFrame = Seq((567), (678)).toDF("id")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
usersFactorsNew.union(uF).show(false)
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|567|[0.8259185719733708, 0.327713892339658, 0.049547223031371046, 0.056661808506210054, 0.5846626163454274, 0.038497936270104005, 0.8970865088803417, 0.8840660648882804, 0.837866669938156, 0.9395263094918058, 0.09179528484355126, 0.4915430644129799, 0.11083447052043116, 0.5122858182953718, 0.4302683812966408, 0.3862741815833828, 0.6189322403095068, 0.3000371006293433, 0.09331299668168902, 0.7421838728601371, 0.855867963988993]|
|678|[0.7686514248005568, 0.5473580740023187, 0.072945344124282, 0.36648594574355287, 0.9780202082328863, 0.5289221651923784, 0.3719451099963028, 0.2824660794505932, 0.4873197501260199, 0.9364676464120849, 0.011539929543513794, 0.5240615794930654, 0.6282546154521298, 0.995256022569878, 0.6659179561266975, 0.8990775317754092, 0.08650071017556926, 0.5190186149992805, 0.056345335742325475, 0.6465357505620791, 0.17913532817943245] |
|123|[0.04177388548851013, 0.26762014627456665, -0.19617630541324615, 0.34298020601272583, 0.19632814824581146, -0.2748605012893677, 0.07724890112876892, 0.4277132749557495, 0.1927199512720108, -0.40271613001823425]                                                                                                                                                                                                                        |
|234|[0.04139673709869385, 0.26520395278930664, -0.19440513849258423, 0.3398836553096771, 0.1945556253194809, -0.27237895131111145, 0.07655145972967148, 0.42385169863700867, 0.19098000228405, -0.39908021688461304]                                                                                                                                                                                                                          |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+