Scala 在DataFrame中添加新列，并为另一列添加邻居数'；s值_Scala_Apache Spark_Dataframe_Spark Dataframe

Scala 在DataFrame中添加新列，并为另一列添加邻居数'；s值

scala apache-spark dataframe

Scala 在DataFrame中添加新列，并为另一列添加邻居数'；s值,scala,apache-spark,dataframe,spark-dataframe,Scala,Apache Spark,Dataframe,Spark Dataframe,我有一个这样的数据帧： org.apache.spark.sql.DataFrame = [Timestamp: int, AccX: double ... 17 more fields]` 时间戳不是连续的，是历元格式的我想添加一个新的列，它将为每一行提供接近当前行的时间戳数量的时间戳例如：时间戳 1 5 6 12 13 16 假设我们有一个3的范围。产出将是： | TimeStamp | New column | | 1

我有一个这样的

数据帧

：

org.apache.spark.sql.DataFrame = [Timestamp: int, AccX: double ... 17 more fields]`

时间戳不是连续的，是历元格式的

我想添加一个新的列，它将为每一行提供接近当前行的时间戳数量的时间戳

例如：

时间戳

假设我们有一个3的范围。产出将是：

|      TimeStamp      |    New column    |
|          1          |         1        |
|          5          |         2        |
|          6          |         2        |
|          12         |         2        |
|          13         |         3        |
|          16         |         2        |

我想做一些类似的事情：

MyDF.map{x => MyDF.filter(MyDF("Timestamp").gt(x.getAs[Int]("Timestamp") - range).lt(x.getAs[Int]("Timestamp") + range) ).count()}

但这会留下一个：

org.apache.spark.sql.Dataset[Long]=[value:bigint]

我不知道该怎么处理

有没有人对如何处理这个问题有更好的想法

谢谢

更新： 我使用的是飞艇笔记本，运行的是Spark 2.1.1版在尝试了@Dennis Tsoi提出的解决方案后，我在尝试对结果数据帧执行操作（如show或collect）时出错

以下是错误的全文：

org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2104)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)
  at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:371)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2386)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withNewExecutionId(Dataset.scala:2788)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2385)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2392)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2128)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2127)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2818)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2127)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2342)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:638)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:597)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:606)
  ... 88 elided
Caused by: java.io.NotSerializableException: org.apache.spark.sql.expressions.WindowSpec
Serialization stack:
    - object not serializable (class: org.apache.spark.sql.expressions.WindowSpec, value: org.apache.spark.sql.expressions.WindowSpec@79df42d)
    - field (class: $iw, name: windowSpec, type: class org.apache.spark.sql.expressions.WindowSpec)
    - object (class $iw, $iw@20ade815)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@77cac38a)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1ebfd642)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1ee19937)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@67b1d8f0)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@16ca3d83)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@3129d731)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@142a2936)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@494facc5)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@45e32c0a)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@509c3eb6)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@7bba53a2)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@20971db8)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@ba81c26)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@9375cbb)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@3226a593)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@201516a3)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1ac15b76)
    - field (class: $line20176553781522.$read, name: $iw, type: class $iw)
    - object (class $line20176553781522.$read, $line20176553781522.$read@21cc8115)
    - field (class: $iw, name: $line20176553781522$read, type: class $line20176553781522.$read)
    - object (class $iw, $iw@57677eee)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1d619339)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@63f875)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@2a8641fe)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@279b1062)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@2a06eb02)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6071a045)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@36b8b963)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@49987884)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6cdfa5ad)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@3bea2150)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@7d1c7dc)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@78f47403)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6327d388)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5d120092)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@4da8dd9c)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@2afee9a4)
    - field (class: $line20176553782370.$read, name: $iw, type: class $iw)
    - object (class $line20176553782370.$read, $line20176553782370.$read@7112605e)
    - field (class: $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, name: $line20176553782370$read, type: class $line20176553782370.$read)
    - object (class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw@cc82e3c)
    - field (class: $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, name: $outer, type: class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw)
    - object (class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw@9ec8a4e)
    - field (class: $$$$7f619eaa173efe86d354fc4efb19aab8$$$$$anonfun$1, name: $outer, type: class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw)
    - object (class $$$$7f619eaa173efe86d354fc4efb19aab8$$$$$anonfun$1, <function1>)
    - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2, name: func$2, type: interface scala.Function1)
    - object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2, <function1>)
    - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
    - object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF(input[0, int, true]))
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 1)
    - field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
    - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:10
0)
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
  ... 121 more

org.apache.spark.SparkException:任务不可序列化
位于org.apache.spark.util.ClosureCleaner$.ensureSerializable（ClosureCleaner.scala:298）
位于org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean（ClosureCleaner.scala:288）
位于org.apache.spark.util.ClosureCleaner$.clean（ClosureCleaner.scala:108）
位于org.apache.spark.SparkContext.clean（SparkContext.scala:2104）
在org.apache.spark.rdd.rdd$$anonfun$mapPartitionsWithIndex$1.apply上（rdd.scala:841）
位于org.apache.spark.rdd.rdd$$anonfun$mapPartitionsWithIndex$1.apply（rdd.scala:840）
位于org.apache.spark.rdd.RDDOperationScope$.withScope（RDDOperationScope.scala:151）
位于org.apache.spark.rdd.RDDOperationScope$.withScope（RDDOperationScope.scala:112）
位于org.apache.spark.rdd.rdd.withScope（rdd.scala:362）
位于org.apache.spark.rdd.rdd.mapPartitionsWithIndex（rdd.scala:840）
位于org.apache.spark.sql.execution.whisttagecodegenexec.doExecute（whisttagecodegenexec.scala:371）
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply（SparkPlan.scala:117）
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply（SparkPlan.scala:117）
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply（SparkPlan.scala:138）
位于org.apache.spark.rdd.RDDOperationScope$.withScope（RDDOperationScope.scala:151）
位于org.apache.spark.sql.execution.SparkPlan.executeQuery（SparkPlan.scala:135）
位于org.apache.spark.sql.execution.SparkPlan.execute（SparkPlan.scala:116）
位于org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd（SparkPlan.scala:228）
位于org.apache.spark.sql.execution.SparkPlan.executeTake（SparkPlan.scala:311）
位于org.apache.spark.sql.execution.CollectLimitExec.executeCollect（limit.scala:38）
位于org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply（Dataset.scala:2386）
位于org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId（SQLExecution.scala:57）
在org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withNewExecutionId（Dataset.scala:2788）
位于org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1（Dataset.scala:2385）
位于org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect（Dataset.scala:2392）
位于org.apache.spark.sql.Dataset$$anonfun$head$1.apply（Dataset.scala:2128）
位于org.apache.spark.sql.Dataset$$anonfun$head$1.apply（Dataset.scala:2127）
位于org.apache.spark.sql.Dataset.withTypedCallback（Dataset.scala:2818）
位于org.apache.spark.sql.Dataset.head（Dataset.scala:2127）
位于org.apache.spark.sql.Dataset.take（Dataset.scala:2342）
位于org.apache.spark.sql.Dataset.showString（Dataset.scala:248）
在org.apache.spark.sql.Dataset.show（Dataset.scala:638）上
在org.apache.spark.sql.Dataset.show（Dataset.scala:597）上
在org.apache.spark.sql.Dataset.show（Dataset.scala:606）上
... 88删去
原因：java.io.NotSerializableException:org.apache.spark.sql.expressions.WindowSpec
序列化堆栈：
-对象不可序列化（类：org.apache.spark.sql.expressions.WindowSpec，值：org.apache.spark.sql.expressions）。WindowSpec@79df42d)
-字段（类：$iw，名称：windowSpec，类型：class org.apache.spark.sql.expressions.windowSpec）
-对象（类$iw$iw@20ade815)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@77cac38a)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@1ebfd642)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@1ee19937)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@67b1d8f0)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@16ca3d83)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@3129d731)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@142a2936)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@494facc5)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@45e32c0a)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@509c3eb6)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@7bba53a2)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@20971db8)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@ba81c26)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@9375cbb)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@3226a593)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@201516a3)
-字段（类：$iw，名称：$iw，类型：类$iw）
-对象（类$iw$iw@1ac15b76)
-字段（类：$line20176553781522.$read，名称：$iw，类型：类$iw）
-对象（类$line20176553781522.$read，$line20176553781522$read@21cc8115)
-字段（类：$iw，名称：$line20176553781522$read，类型：class
val timestampsDF = 
    Seq(
        ( 1, "smth1" ),
        ( 5, "smth2" ),
        ( 6, "smth3" ),
        ( 12, "smth4" ),
        ( 13, "smth5" ),
        ( 16, "smth6" )
    )
    .toDF( "TimeStamp", "smth" )

val timestampsStatic = 
    timestampsDF
    .select("TimeStamp")
    .as[ ( Int ) ]
    .collect()

def countNeighbors = udf( ( currentTs: Int, timestamps: Seq[ Int ] ) => {

    timestamps.count( ( ts ) => Math.abs( currentTs - ts ) <= 3 )
} )

val alltimeDF = 
    timestampsDF
    .withColumn( 
        "All TimeStamps", 
        lit( timestampsStatic )
    )

val neighborsDF =
    alltimeDF
    .withColumn( 
        "New Column", 
        countNeighbors( alltimeDF( "TimeStamp" ), alltimeDF( "All TimeStamps" ) )
    )
    .drop( "All TimeStamps" )

neighborsDF.show()

+---------+-----+----------+
|TimeStamp| smth|New Column|
+---------+-----+----------+
|        1|smth1|         1|
|        5|smth2|         2|
|        6|smth3|         2|
|       12|smth4|         2|
|       13|smth5|         3|
|       16|smth6|         2|
+---------+-----+----------+