如何在spark scala中创建日期范围的容器?

|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
|                       3|                    null|
|                    null|                       2|




// First we set up the problem

// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")

// Get the current local date
val now = java.time.LocalDate.now

// Create a range of 1-10000 and map each to minusDays 
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))

// Create a DataFrame we can work with.
val df = dates.toDF("date")
到目前为止还不错。我们有日期条目要处理,它们类似于您的格式MM dd yyyy。 接下来,我们需要一个函数,如果日期在范围内,则返回1,如果不在范围内,则返回0。我们从这个函数创建一个UserDefinedFunction UDF,这样我们就可以将它同时应用于Spark执行器中的所有行

// We will process each range one at a time, so we'll take it as a string 
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat

def isWithinRange(date: String, binRange: String): Int = {
  val format = new SimpleDateFormat("MM-dd-yyyy")
  val startDate = format.parse(binRange.split(" - ").head)
  val endDate = format.parse(binRange.split(" - ").last)
  val testDate = format.parse(date)

  if (!(testDate.before(startDate) || testDate.after(endDate))) 1
  else 0

// We create a udf which uses an anonymous function taking two args and 
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf

def isWithinRangeUdf: UserDefinedFunction =
  udf((date: String, binRange: String) => isWithinRange(date, binRange))


2450+3653+3897=10000,所以我们的工作似乎是正确的。 也许我做得太过分了,有一个更简单的解决方案,请告诉我您是否知道更好的方法,特别是处理MM dd yyyy日期。

// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
                "01-01-2000 - 12-31-2009",
                "01-01-2010 - 12-31-2020")

// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}

val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
  changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin))) 

//|      date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//|09-01-2020|                      0|                      0|                      1|
//only showing top 1 row
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)

//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//|                   2450|                   3653|                   3897|