如何在spark scala中创建日期范围的容器?

如何在spark scala中创建日期范围的容器?,scala,apache-spark,Scala,Apache Spark,嗨,最近怎么样?我是一名试图学习Spark Scala的Python开发人员。我的任务是创建日期范围箱子,并计算每个箱子直方图中出现的频率 我的输入数据框看起来像这样 在Python中,我的bin边如下所示: bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"] 我要查找的输出数据帧是每个bin的原始数据帧中有多少个值的计数: 有没有人可以指导我如何使用spark scala?我有点迷

嗨,最近怎么样?我是一名试图学习Spark Scala的Python开发人员。我的任务是创建日期范围箱子,并计算每个箱子直方图中出现的频率

我的输入数据框看起来像这样

在Python中,我的bin边如下所示:

bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]

我要查找的输出数据帧是每个bin的原始数据帧中有多少个值的计数:


有没有人可以指导我如何使用spark scala?我有点迷路了。谢谢。

您是否希望得到如下结果:

+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
|                       3|                    null|
|                    null|                       2|
+------------------------+------------------------+
它可以通过少量的sparksql和pivot函数实现,如下所示

检查左连接条件


尽管如此,由于您有2个bin范围,因此将生成2行。

我们可以通过查看日期列并确定每条记录在哪个范围内来实现这一点

// First we set up the problem

// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")

// Get the current local date
val now = java.time.LocalDate.now

// Create a range of 1-10000 and map each to minusDays 
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))

// Create a DataFrame we can work with.
val df = dates.toDF("date")
到目前为止还不错。我们有日期条目要处理,它们类似于您的格式MM dd yyyy。 接下来,我们需要一个函数,如果日期在范围内,则返回1,如果不在范围内,则返回0。我们从这个函数创建一个UserDefinedFunction UDF,这样我们就可以将它同时应用于Spark执行器中的所有行

// We will process each range one at a time, so we'll take it as a string 
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat

def isWithinRange(date: String, binRange: String): Int = {
  val format = new SimpleDateFormat("MM-dd-yyyy")
  val startDate = format.parse(binRange.split(" - ").head)
  val endDate = format.parse(binRange.split(" - ").last)
  val testDate = format.parse(date)

  if (!(testDate.before(startDate) || testDate.after(endDate))) 1
  else 0
}

// We create a udf which uses an anonymous function taking two args and 
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf

def isWithinRangeUdf: UserDefinedFunction =
  udf((date: String, binRange: String) => isWithinRange(date, binRange))
现在我们已经有了UDF设置,我们在数据框中创建了新的列,并按给定的BIN进行分组,然后对这些值求和,因此我们将函数的计算结果设为Int

最后,我们选择bin列,并按它们分组和求和

2450+3653+3897=10000,所以我们的工作似乎是正确的。 也许我做得太过分了,有一个更简单的解决方案,请告诉我您是否知道更好的方法,特别是处理MM dd yyyy日期。

为什么01-01-2000-12-31-2009显示2?不是4+3=7吗?计数01-20-2001+计数02-01-2005?结果中的0是什么没关系,我明白了,你在一个垃圾桶范围内计算“条目”。
// We will process each range one at a time, so we'll take it as a string 
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat

def isWithinRange(date: String, binRange: String): Int = {
  val format = new SimpleDateFormat("MM-dd-yyyy")
  val startDate = format.parse(binRange.split(" - ").head)
  val endDate = format.parse(binRange.split(" - ").last)
  val testDate = format.parse(date)

  if (!(testDate.before(startDate) || testDate.after(endDate))) 1
  else 0
}

// We create a udf which uses an anonymous function taking two args and 
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf

def isWithinRangeUdf: UserDefinedFunction =
  udf((date: String, binRange: String) => isWithinRange(date, binRange))
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
                "01-01-2000 - 12-31-2009",
                "01-01-2010 - 12-31-2020")


// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}

val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
  changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin))) 
}

withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//|      date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020|                      0|                      0|                      1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)

summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//|                   2450|                   3653|                   3897|
//+-----------------------+-----------------------+-----------------------+