Apache spark 如何使用Scala/Spark在dataframe中添加不基于现有列的新列？_Apache Spark_Apache Spark Sql

Apache spark 如何使用Scala/Spark在dataframe中添加不基于现有列的新列？

apache-spark

Apache spark 如何使用Scala/Spark在dataframe中添加不基于现有列的新列？,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个数据框，我想添加一个新的列，但不基于exit列，我该怎么办这是我的数据帧： +----+ |time| +----+ | 1| | 4| | 3| | 2| | 5| | 7| | 3| | 5| +----+ 这是我的预期结果： +----+-----+ |time|index| +----+-----+ | 1| 1| | 4| 2| | 3| 3| | 2| 4| | 5|

我有一个

数据框

，我想添加一个新的

列

，但不基于exit列，我该怎么办

这是我的数据帧：

+----+
|time|
+----+
|   1|
|   4|
|   3|
|   2|
|   5|
|   7|
|   3|
|   5|
+----+

这是我的预期结果：

+----+-----+  
|time|index|  
+----+-----+  
|   1|    1|  
|   4|    2|  
|   3|    3|  
|   2|    4|  
|   5|    5|  
|   7|    6|  
|   3|    7|  
|   5|    8|  
+----+-----+

使用rdd zipWithIndex可能是您想要的

val newRdd = yourDF.rdd.zipWithIndex.map{case (r: Row, id: Long) => Row.fromSeq(r.toSeq :+ id)}
val schema = StructType(Array(StructField("time", IntegerType, nullable = true), StructField("index", LongType, nullable = true)))
val newDF = spark.createDataFrame(newRdd, schema)
newDF.show
+----+-----+                                                                    
|time|index|
+----+-----+
|   1|    0|
|   4|    1|
|   3|    2|
|   2|    3|
|   5|    4|
|   7|    5|
|   3|    6|
|   8|    7|
+----+-----+

我假设您的时间列在这里是IntegerType。

而不是使用Window

函数

和转换为

rdd

以及使用

zipWithIndex

更慢，您可以使用一个内置函数来单调地增加id

import org.apache.spark.sql.functions._
df.withColumn("index", monotonically_increasing_id())

希望这次和平

按照您的方式，我必须将DataFrame更改为rdd，然后将rdd更改为DataFrame，这是低效的。我不确定这是否是最佳解决方案，但应该没有严重的性能问题@转换为rdd不会带来严重的性能问题。但dataframe使用的是tugsten格式，而rdd不使用这种格式。