Spark scala中数据帧行内容的条件

Spark scala中数据帧行内容的条件,scala,apache-spark,dataframe,apache-spark-sql,Scala,Apache Spark,Dataframe,Apache Spark Sql,我有以下数据帧: +--------+---------+------+ | value1| value2 |value3| +--------+---------+------+ | a | 2 | 3 | +--------+---------+------+ | b | 5 | 4 | +--------+---------+------+ | b | 5 | 4 | +--------+-------

我有以下数据帧:

+--------+---------+------+
|  value1| value2  |value3|
+--------+---------+------+
|   a    |  2      |   3  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   c    |  3      |   4  |
+--------+---------+------+
当value1=b时,我想将行的value2/value3的结果放入一个名为“result”的新字段中,然后将其添加到所有行中。这意味着必须向数据帧添加另一列。例如,对于所有行,5/4的结果(我之所以选择它,是因为它代表b)应该添加到dataframe。我知道,我应该使用以下代码:

 val dataframe_new = Dataframe.withColumn("result", $"value1" / $"value2")
 Dataframe.show()
但是,我如何将条件以这样的方式放置,它将其添加到所有行中。输出应如下所示:

+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
|  a|  2|  3|  1.25|
|  b|  5|  4|  1.25|
|  b|  5|  4|  1.25|
|  c|  3|  4|  1.25|
+---+---+---+------+

你能帮我吗?提前谢谢

您只需在以下情况下使用

scala> val df = Seq(("a",2,3),("b",5,4),("b",5,4),("c",3,4)).toDF("v1","v2","v3")
df: org.apache.spark.sql.DataFrame = [v1: string, v2: int ... 1 more field]

scala> df.withColumn("result", when($"v1" === "b" , ($"v2"/$"v3"))).show
+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
|  a|  2|  3|  null|
|  b|  5|  4|  1.25|
|  b|  5|  4|  1.25|
|  c|  3|  4|  null|
+---+---+---+------+
您可以在
时嵌入多个
,如下所示:

scala> df.withColumn("result", when($"v1" === "b" , ($"v2"/$"v3")).
     |    otherwise(when($"v1" === "a", $"v3"/$"v2"))).show
+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
|  a|  2|  3|   1.5|
|  b|  5|  4|  1.25|
|  b|  5|  4|  1.25|
|  c|  3|  4|  null|
+---+---+---+------+
编辑:如果
v1
的条件始终具有相同的值
v2
v3
,则您似乎需要其他条件,从而允许我们执行以下操作:

使用火花2+:

scala> val res = df.filter($"v1" === lit("b")).distinct.select($"v2"/$"v3").as[Double].head
res: Double = 1.25
scala> val res = df.filter($"v1" === lit("b")).distinct.withColumn("result",$"v2"/$"v3").rdd.map(_.getAs[Double]("result")).collect()(0)
res: Double = 1.25                                                              

scala> df.withColumn("v4", lit(res)).show
+---+---+---+----+
| v1| v2| v3|  v4|
+---+---+---+----+
|  a|  2|  3|1.25|
|  b|  5|  4|1.25|
|  b|  5|  4|1.25|
|  c|  3|  4|1.25|
+---+---+---+----+

Spark之前,答案几乎与eliasah相似,但口味不同。我写这篇文章是为了让其他人也能从这种方法中受益

import sqlContext.implicits._

val df = Seq(
  ("a", 2, 3),
  ("b", 5, 4),
  ("b", 5, 4),
  ("c", 3, 4)
).toDF("value1", "value2", "value3")
应该有

+------+------+------+
|value1|value2|value3|
+------+------+------+
|a     |2     |3     |
|b     |5     |4     |
|b     |5     |4     |
|c     |3     |4     |
+------+------+------+

应该生成输出

+------+------+------+------+
|value1|value2|value3|result|
+------+------+------+------+
|a     |2     |3     |1.25  |
|b     |5     |4     |1.25  |
|b     |5     |4     |1.25  |
|c     |3     |4     |1.25  |
+------+------+------+------+

@eliasah您可以使用
.as[Double].head
而不是
.rdd.map().collect()
+------+------+------+------+
|value1|value2|value3|result|
+------+------+------+------+
|a     |2     |3     |1.25  |
|b     |5     |4     |1.25  |
|b     |5     |4     |1.25  |
|c     |3     |4     |1.25  |
+------+------+------+------+