Spark scala中数据帧行内容的条件
我有以下数据帧:Spark scala中数据帧行内容的条件,scala,apache-spark,dataframe,apache-spark-sql,Scala,Apache Spark,Dataframe,Apache Spark Sql,我有以下数据帧: +--------+---------+------+ | value1| value2 |value3| +--------+---------+------+ | a | 2 | 3 | +--------+---------+------+ | b | 5 | 4 | +--------+---------+------+ | b | 5 | 4 | +--------+-------
+--------+---------+------+
| value1| value2 |value3|
+--------+---------+------+
| a | 2 | 3 |
+--------+---------+------+
| b | 5 | 4 |
+--------+---------+------+
| b | 5 | 4 |
+--------+---------+------+
| c | 3 | 4 |
+--------+---------+------+
当value1=b时,我想将行的value2/value3的结果放入一个名为“result”的新字段中,然后将其添加到所有行中。这意味着必须向数据帧添加另一列。例如,对于所有行,5/4的结果(我之所以选择它,是因为它代表b)应该添加到dataframe。我知道,我应该使用以下代码:
val dataframe_new = Dataframe.withColumn("result", $"value1" / $"value2")
Dataframe.show()
但是,我如何将条件以这样的方式放置,它将其添加到所有行中。输出应如下所示:
+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
| a| 2| 3| 1.25|
| b| 5| 4| 1.25|
| b| 5| 4| 1.25|
| c| 3| 4| 1.25|
+---+---+---+------+
你能帮我吗?提前谢谢 您只需在以下情况下使用
:
scala> val df = Seq(("a",2,3),("b",5,4),("b",5,4),("c",3,4)).toDF("v1","v2","v3")
df: org.apache.spark.sql.DataFrame = [v1: string, v2: int ... 1 more field]
scala> df.withColumn("result", when($"v1" === "b" , ($"v2"/$"v3"))).show
+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
| a| 2| 3| null|
| b| 5| 4| 1.25|
| b| 5| 4| 1.25|
| c| 3| 4| null|
+---+---+---+------+
您可以在
时嵌入多个,如下所示:
scala> df.withColumn("result", when($"v1" === "b" , ($"v2"/$"v3")).
| otherwise(when($"v1" === "a", $"v3"/$"v2"))).show
+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
| a| 2| 3| 1.5|
| b| 5| 4| 1.25|
| b| 5| 4| 1.25|
| c| 3| 4| null|
+---+---+---+------+
编辑:如果v1
的条件始终具有相同的值v2
和v3
,则您似乎需要其他条件,从而允许我们执行以下操作:
使用火花2+:
scala> val res = df.filter($"v1" === lit("b")).distinct.select($"v2"/$"v3").as[Double].head
res: Double = 1.25
scala> val res = df.filter($"v1" === lit("b")).distinct.withColumn("result",$"v2"/$"v3").rdd.map(_.getAs[Double]("result")).collect()(0)
res: Double = 1.25
scala> df.withColumn("v4", lit(res)).show
+---+---+---+----+
| v1| v2| v3| v4|
+---+---+---+----+
| a| 2| 3|1.25|
| b| 5| 4|1.25|
| b| 5| 4|1.25|
| c| 3| 4|1.25|
+---+---+---+----+
在Spark之前,答案几乎与eliasah相似,但口味不同。我写这篇文章是为了让其他人也能从这种方法中受益
import sqlContext.implicits._
val df = Seq(
("a", 2, 3),
("b", 5, 4),
("b", 5, 4),
("c", 3, 4)
).toDF("value1", "value2", "value3")
应该有
+------+------+------+
|value1|value2|value3|
+------+------+------+
|a |2 |3 |
|b |5 |4 |
|b |5 |4 |
|c |3 |4 |
+------+------+------+
及
应该生成输出
+------+------+------+------+
|value1|value2|value3|result|
+------+------+------+------+
|a |2 |3 |1.25 |
|b |5 |4 |1.25 |
|b |5 |4 |1.25 |
|c |3 |4 |1.25 |
+------+------+------+------+
@eliasah您可以使用.as[Double].head
而不是.rdd.map().collect()
+------+------+------+------+
|value1|value2|value3|result|
+------+------+------+------+
|a |2 |3 |1.25 |
|b |5 |4 |1.25 |
|b |5 |4 |1.25 |
|c |3 |4 |1.25 |
+------+------+------+------+