Apache spark sql Spark dataframe-如何将列值除以最大列值 问题:
我可以将datafarme列值除以列的最大值吗 SparkSQL可以使用子查询将列值除以最大值Apache spark sql Spark dataframe-如何将列值除以最大列值 问题:,apache-spark-sql,Apache Spark Sql,我可以将datafarme列值除以列的最大值吗 SparkSQL可以使用子查询将列值除以最大值 %sql SELECT cumulativeSum / (SELECT max(cumulativeSum) FROM singularValueDF) FROM singularValueDF 背景 我有SVD中的一行奇异值 val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(numFeatures
%sql
SELECT cumulativeSum / (SELECT max(cumulativeSum) FROM singularValueDF)
FROM singularValueDF
背景
我有SVD中的一行奇异值
val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(numFeatures, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val singluarValues = s.toDense.values
val singularValueRDD = sc.parallelize(singluarValues)
singularValueRDD.toDF("singluar_value").show(5)
+------------------+
| singluar_value|
+------------------+
| 323503.703778161|
|109669.14717327854|
|101621.48745300347|
| 93843.81264344015|
| 87209.07876311651|
...
我需要得到奇异值的累积值
coverage = cumulativeSum / max(cumulativeSum)
+------------------+-----------------+-------------------+
| singluar_value| cumulativeSum| coverage|
+------------------+-----------------+-------------------+
| 323503.703778161| 323503.703778161| 0.0613375619450355|
|109669.14717327854|433172.8509514396| 0.0821312592957559|
|101621.48745300347|534794.3384044431|0.10139908902629156|
| 93843.81264344015|628638.1510478833|0.11919224132702236|
| 87209.07876311651|715847.2298109998|0.13572742224869208|
...
Attmpt
我试图用Dataframe一次性实现这一点,但没有成功
val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(numFeatures, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val singluarValues = s.toDense.values
val windowSpec = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val coverageDF = sc.parallelize(singluarValues).toDF("singluar_value")
.withColumn(
"cumulativeSum",
sum(col("singluar_value")).over(windowSpec)
)
.withColumn(
"coverage",
col("cumulativeSum") / max(col("cumulativeSum"))
)
带着错误
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`singluar_value`' is not an aggregate function. Wrap '((`cumulativeSum` / max(`cumulativeSum`)) AS `coverage`)' in windowing function(s) or wrap '`singluar_value`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [singluar_value#8430, cumulativeSum#8433, (cumulativeSum#8433 / max(cumulativeSum#8433)) AS coverage#8437]
+- Project [singluar_value#8430, cumulativeSum#8433]
+- Project [singluar_value#8430, cumulativeSum#8433, cumulativeSum#8433]
+- Window [sum(singluar_value#8430) windowspecdefinition(singluar_value#8430 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS cumulativeSum#8433], [singluar_value#8430 DESC NULLS LAST]
+- Project [singluar_value#8430]
+- Project [value#8428 AS singluar_value#8430]
+- SerializeFromObject [input[0, double, false] AS value#8428]
+- ExternalRDD [obj#8427]
变通办法
首先获取max,然后将其与literal(lit)函数一起使用,但它太麻烦了
val windowSpec = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val singularValueRDD = sc.parallelize(singluarValues)
val singularValueDF = singularValueRDD.toDF("singluar_value")
.withColumn(
"cumulativeSum",
sum(col("singluar_value")).over(windowSpec)
)
val total = singularValueDF.select(max(col("cumulativeSum"))).collect()(0).getDouble(0)
val coverageDF = singularValueDF
.withColumn(
"coverage",
col("cumulativeSum") / lit(total)
)
coverageDF.show(5)
val windowSpec = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpecAll = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val coverageDF = sc.parallelize(singluarValues).toDF("singluar_value")
.withColumn(
"id",
row_number().over(windowSpec)
)
.select("id", "singluar_value")
.withColumn(
"eignvalue",
pow(col("singluar_value"), 2) / lit(numSamples -1)
)
.withColumn(
"cumulativeSum",
sum(col("eignvalue")).over(windowSpec)
)
.withColumn(
"coverage",
col("cumulativeSum") / last(col("cumulativeSum")).over(windowSpecAll)
)
coverageDF.show(5)
+---+------------------+------------------+------------------+-------------------+
| id| singluar_value| eignvalue| cumulativeSum| coverage|
+---+------------------+------------------+------------------+-------------------+
| 1| 323503.703778161| 2491836.623685996| 2491836.623685996|0.43500390900934977|
| 2|109669.14717327854| 286371.6241271037| 2778208.2478131|0.48499626193511053|
| 3|101621.48745300347|245885.06183863763|3024093.3096517376| 0.5279208108602296|
| 4| 93843.81264344015|209687.40140139285| 3233780.71105313| 0.5645262762477199|
| 5| 87209.07876311651|181085.82153649992| 3414866.53258963| 0.5961387180449769|
+---+------------------+------------------+------------------+-------------------+