Spark/Scala:最后观察到的正向填充(2)

Spark/Scala:最后观察到的正向填充(2),scala,apache-spark,Scala,Apache Spark,关于这个问题: 我试图重现这个问题并解决它 我已经创建了一个文件mre.csv: Date,B 2015-06-01,33 2015-06-02, 2015-06-03, 2015-06-04, 2015-06-05,22 2015-06-06, 2015-06-07, 然后我读了文件: var df = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("

关于这个问题:

我试图重现这个问题并解决它

我已经创建了一个文件mre.csv:

Date,B
2015-06-01,33
2015-06-02,
2015-06-03,
2015-06-04,
2015-06-05,22
2015-06-06,
2015-06-07,
然后我读了文件:

var df = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("D:/playground/mre.csv")

df.show()

val rows: RDD[Row] = df.orderBy($"Date").rdd
val schema = df.schema
然后,我使用以下代码解决了问题:

df = df.withColumn("id",lit(1))
var spec = Window.partitionBy("id").orderBy("Date")
val df2 = df.withColumn("B", coalesce((0 to 6).map(i=>lag(df.col("B"),i,0).over(spec)): _*))

df2.show()
def notMissing(row: Row): Boolean = { !row.isNullAt(1) }

val toCarry: scala.collection.Map[Int,Option[org.apache.spark.sql.Row]] = rows
  .mapPartitionsWithIndex{ case (i, iter) =>
    Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }
  .collectAsMap

val toCarryBd = sc.broadcast(toCarry)

def fill(i: Int, iter: Iterator[Row]): Iterator[Row] = {
  if (iter.contains(null)) iter.map(row => Row(toCarryBd.value(i).get(1))) else iter
}

val imputed: RDD[Row] = rows
  .mapPartitionsWithIndex{ case (i, iter) => fill(i, iter) }

val df2 = spark.createDataFrame(imputed, schema).toDF()

df2.show()
输出:

+-------------------+---+---+
|               Date|  B| id|
+-------------------+---+---+
|2015-06-01 00:00:00| 33|  1|
|2015-06-02 00:00:00| 33|  1|
|2015-06-03 00:00:00| 33|  1|
|2015-06-04 00:00:00| 33|  1|
|2015-06-05 00:00:00| 22|  1|
|2015-06-06 00:00:00| 22|  1|
|2015-06-07 00:00:00| 22|  1|
+-------------------+---+---+
+-------------------+---+
|               Date|  B|
+-------------------+---+
|2015-06-01 00:00:00| 33|
|2015-06-02 00:00:00| 33|
|2015-06-03 00:00:00| 33|
|2015-06-04 00:00:00| 33|
|2015-06-05 00:00:00| 22|
|2015-06-06 00:00:00| 22|
|2015-06-07 00:00:00| 22|
+-------------------+---+
但问题是,它都是在一个分区中计算的,所以我没有真正利用Spark

因此,我尝试插入以下代码:

df = df.withColumn("id",lit(1))
var spec = Window.partitionBy("id").orderBy("Date")
val df2 = df.withColumn("B", coalesce((0 to 6).map(i=>lag(df.col("B"),i,0).over(spec)): _*))

df2.show()
def notMissing(row: Row): Boolean = { !row.isNullAt(1) }

val toCarry: scala.collection.Map[Int,Option[org.apache.spark.sql.Row]] = rows
  .mapPartitionsWithIndex{ case (i, iter) =>
    Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }
  .collectAsMap

val toCarryBd = sc.broadcast(toCarry)

def fill(i: Int, iter: Iterator[Row]): Iterator[Row] = {
  if (iter.contains(null)) iter.map(row => Row(toCarryBd.value(i).get(1))) else iter
}

val imputed: RDD[Row] = rows
  .mapPartitionsWithIndex{ case (i, iter) => fill(i, iter) }

val df2 = spark.createDataFrame(imputed, schema).toDF()

df2.show()
但产出令人失望:

+----+---+
|Date|  B|
+----+---+
+----+---+

这里的
fill
功能的实现是错误的。请看一下所提问题答案中提到的步骤

def fill(i: Int, iter: Iterator[Row]): Iterator[Row] = {
  // If it is the beginning of partition and value is missing
  // extract value to fill from toCarryBd.value
  // Remember to correct for empty / only missing partitions
  // otherwise take last not-null from the current partition
}
我已实施了以下措施:

def notMissing(row: Row): Boolean = { !row.isNullAt(1) }

val toCarryTemp: scala.collection.Map[Int,Option[org.apache.spark.sql.Row]] = rows
  .mapPartitionsWithIndex{ case (i, iter) =>
    Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }
  .collectAsMap
从映射中提取col B值,并遍历它,以在当前分区有空值的情况下使用以前的分区值填充该值。如果跳过这一步,我们将得到如下输出:

+-------------------+---+
|               Date|  B|
+-------------------+---+
|2015-06-01 00:00:00| 33|
|2015-06-02 00:00:00|  0|
|2015-06-03 00:00:00|  0|
|2015-06-04 00:00:00|  0|
|2015-06-05 00:00:00| 22|
|2015-06-06 00:00:00|  0|
|2015-06-07 00:00:00|  0|
+-------------------+---+
输出:

+-------------------+---+---+
|               Date|  B| id|
+-------------------+---+---+
|2015-06-01 00:00:00| 33|  1|
|2015-06-02 00:00:00| 33|  1|
|2015-06-03 00:00:00| 33|  1|
|2015-06-04 00:00:00| 33|  1|
|2015-06-05 00:00:00| 22|  1|
|2015-06-06 00:00:00| 22|  1|
|2015-06-07 00:00:00| 22|  1|
+-------------------+---+---+
+-------------------+---+
|               Date|  B|
+-------------------+---+
|2015-06-01 00:00:00| 33|
|2015-06-02 00:00:00| 33|
|2015-06-03 00:00:00| 33|
|2015-06-04 00:00:00| 33|
|2015-06-05 00:00:00| 22|
|2015-06-06 00:00:00| 22|
|2015-06-07 00:00:00| 22|
+-------------------+---+

嗨,艾伦,既然你在使用日期,我们可以考虑数据集足够小,可以广播吗?我们的想法是至少广播定义了
B
的行,这样您就可以以分布式方式工作,考虑日期间隔来确定分配给B的值。在fill()方法中,在映射中,“行”的值是如何更新的?我没有看到您将fillUtil()的返回值插入变量,并且fillUtil()中的“行”没有更改。要么我错过了什么,要么这里有一些魔力。我要试试这个。你试过比例尺吗?@Alon谢谢你指出。我已经编辑了答案。它给出正确结果的原因是,
fillUtil
没有在映射内被调用,因为所有时间戳都存在于它们自己的分区中,因此数据大小很小。要测试它,只需执行
val rows:RDD[Row]=df.orderBy($“Date”).RDD.repartition(2)
,它将在map中调用
fillUtil
。@蓝色幻影不,我没有按比例尝试过。我会在我有时间的时候做,或者分享你的发现,如果你测试的话。我会尝试,但你也应该得到赏金。