scala spark-groupBy查找日期范围内月与月之间的平均值
我正在查看无人机租赁数据集。 我想尝试通过Spark中的结果列进行分组,以显示每个无人机的平均结果($),作为该月所用天数的函数 即,结果列中的值除以总天数,然后得出开始日期和结束日期之间每个月的天数scala spark-groupBy查找日期范围内月与月之间的平均值,scala,apache-spark,Scala,Apache Spark,我正在查看无人机租赁数据集。 我想尝试通过Spark中的结果列进行分组,以显示每个无人机的平均结果($),作为该月所用天数的函数 即,结果列中的值除以总天数,然后得出开始日期和结束日期之间每个月的天数 +------+------------------+------------------+--------+ | Drone| Start | End | Result | +------+------------------+----------
+------+------------------+------------------+--------+
| Drone| Start | End | Result |
+------+------------------+------------------+--------+
| DR1 16/06/2013 10:30 22/08/2013 07:00 2786 |
| DR1 20/04/2013 23:30 16/06/2013 10:30 7126 |
| DR1 24/01/2013 23:00 20/04/2013 23:30 2964 |
| DR2 01/03/2014 19:00 07/05/2014 18:00 8884 |
| DR2 04/09/2015 09:00 04/11/2015 07:00 7828 |
| DR2 04/10/2013 05:00 24/12/2013 07:00 5700 |
+-----------------------------------------------------+
这很困难,因为这是一项长期租赁业务,与一次约会无关,因此简单的groupBy对我不起作用
请注意,在完整的数据集中,无人机是按每分钟租用的,这有点混乱
我希望能在正确的思考过程中得到一些帮助,以解决类似这样的问题,以及代码的外观
你如何改变我在下面写下的每一个月作为一个单独的案例?(我只能以开始日期为基础):/
以每种无人机类型的第一个示例为例,我的预期输出为:
+------+-------+-------+---------+
|Drone | Month | Days | Avg |
+------+-------+-------+---------+
|DR1 June X $YY |
|DR1 July X $YY |
|DR1 August X $YY |
|DR2 March Y $ZZ |
|DR2 April Y $ZZ |
|DR2 May Y $ZZ |
+--------------------------------+
非常感谢您能检查一下吗?。我在日期格式中使用了“MMM-yy”,因此,如果开始日期和结束日期跨越几年,那么它将很容易区分。如果你只需要一个月的时间,你可以把它改成“嗯”
scala> val df_t = Seq(("DR1","16/06/2013 10:30","22/08/2013 07:00",2786),("DR1","20/04/2013 23:30","16/06/2013 10:30",7126),("DR1","24/01/2013 23:00","20/04/2013 23:30",2964),("DR2","01/03/2014 19:00","07/05/2014 18:00",8884),("DR2","04/09/2015 09:00","04/11/2015 07:00",7828),("DR2","04/10/2013 05:00","24/12/2013 07:00",5700)).toDF("drone","start","end","result")
df_t: org.apache.spark.sql.DataFrame = [drone: string, start: string ... 2 more fields]
scala> val df = df_t.withColumn("start",to_timestamp('start,"dd/MM/yyyy HH:mm")).withColumn("end",to_timestamp('end,"dd/MM/yyyy HH:mm"))
df: org.apache.spark.sql.DataFrame = [drone: string, start: timestamp ... 2 more fields]
scala> df.show(false)
+-----+-------------------+-------------------+------+
|drone|start |end |result|
+-----+-------------------+-------------------+------+
|DR1 |2013-06-16 10:30:00|2013-08-22 07:00:00|2786 |
|DR1 |2013-04-20 23:30:00|2013-06-16 10:30:00|7126 |
|DR1 |2013-01-24 23:00:00|2013-04-20 23:30:00|2964 |
|DR2 |2014-03-01 19:00:00|2014-05-07 18:00:00|8884 |
|DR2 |2015-09-04 09:00:00|2015-11-04 07:00:00|7828 |
|DR2 |2013-10-04 05:00:00|2013-12-24 07:00:00|5700 |
+-----+-------------------+-------------------+------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
def months_range(a:java.sql.Date,b:java.sql.Date):Seq[String]=
{
import java.time._
import java.time.format._
val start = a.toLocalDate
val end = b.toLocalDate
(start.toEpochDay until end.toEpochDay).map(LocalDate.ofEpochDay(_)).map(DateTimeFormatter.ofPattern("MMM-yy").format(_)).toSet.toSeq
}
// Exiting paste mode, now interpreting.
months_range: (a: java.sql.Date, b: java.sql.Date)Seq[String]
scala> val udf_months_range = udf( months_range(_:java.sql.Date,_:java.sql.Date):Seq[String] )
udf_months_range: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StringType,true),Some(List(DateType, DateType)))
scala> val df2 = df.withColumn("days",datediff('end,'start)).withColumn("diff_months",udf_months_range('start,'end))
df2: org.apache.spark.sql.DataFrame = [drone: string, start: timestamp ... 4 more fields]
scala> df2.show(false)
+-----+-------------------+-------------------+------+----+--------------------------------+
|drone|start |end |result|days|diff_months |
+-----+-------------------+-------------------+------+----+--------------------------------+
|DR1 |2013-06-16 10:30:00|2013-08-22 07:00:00|2786 |67 |[Jun-13, Jul-13, Aug-13] |
|DR1 |2013-04-20 23:30:00|2013-06-16 10:30:00|7126 |57 |[Apr-13, May-13, Jun-13] |
|DR1 |2013-01-24 23:00:00|2013-04-20 23:30:00|2964 |86 |[Jan-13, Feb-13, Mar-13, Apr-13]|
|DR2 |2014-03-01 19:00:00|2014-05-07 18:00:00|8884 |67 |[Mar-14, Apr-14, May-14] |
|DR2 |2015-09-04 09:00:00|2015-11-04 07:00:00|7828 |61 |[Sep-15, Oct-15, Nov-15] |
|DR2 |2013-10-04 05:00:00|2013-12-24 07:00:00|5700 |81 |[Oct-13, Nov-13, Dec-13] |
+-----+-------------------+-------------------+------+----+--------------------------------+
scala> df2.withColumn("month",explode('diff_months)).withColumn("Avg",'result/'days).select("drone","month","days","avg").show(false)
+-----+------+----+------------------+
|drone|month |days|avg |
+-----+------+----+------------------+
|DR1 |Jun-13|67 |41.582089552238806|
|DR1 |Jul-13|67 |41.582089552238806|
|DR1 |Aug-13|67 |41.582089552238806|
|DR1 |Apr-13|57 |125.01754385964912|
|DR1 |May-13|57 |125.01754385964912|
|DR1 |Jun-13|57 |125.01754385964912|
|DR1 |Jan-13|86 |34.46511627906977 |
|DR1 |Feb-13|86 |34.46511627906977 |
|DR1 |Mar-13|86 |34.46511627906977 |
|DR1 |Apr-13|86 |34.46511627906977 |
|DR2 |Mar-14|67 |132.59701492537314|
|DR2 |Apr-14|67 |132.59701492537314|
|DR2 |May-14|67 |132.59701492537314|
|DR2 |Sep-15|61 |128.327868852459 |
|DR2 |Oct-15|61 |128.327868852459 |
|DR2 |Nov-15|61 |128.327868852459 |
|DR2 |Oct-13|81 |70.37037037037037 |
|DR2 |Nov-13|81 |70.37037037037037 |
|DR2 |Dec-13|81 |70.37037037037037 |
+-----+------+----+------------------+
scala>
scala>val df_t=Seq((“DR1”,“16/06/2013 10:30”,“22/08/2013 07:00”,2786),(“DR1”,“20/04/2013 23:30”,“16/06/2013 10:30”,7126),(“DR1”,“24/01/2013 23:00”,“20/04/2013 23:30”,2964),(“DR2”,“01/03/2014 19:00”,“07/05/2014 18:00”,8884),(“DR2”,“04/09/2015 09/09:00”,“04/11/2015 07:00”,7828),(“DR2”,“04/10/05/2013 24:05”,“2007:00”),“开始”、“结束”、“结果”)
df_t:org.apache.spark.sql.DataFrame=[drone:string,start:string…另外两个字段]
scala>val df=df_t.带列(“开始”,至时间戳(“开始”,dd/MM/yyyy HH:MM”)。带列(“结束”,至时间戳(“结束,dd/MM/yyyy HH:MM”))
df:org.apache.spark.sql.DataFrame=[drone:string,start:timestamp…另外两个字段]
scala>df.show(假)
+-----+-------------------+-------------------+------+
|无人机|开始|结束|结果|
+-----+-------------------+-------------------+------+
|DR1 | 2013-06-16 10:30:00 | 2013-08-22 07:00:00 | 2786|
|DR1 | 2013-04-20 23:30:00 | 2013-06-16 10:30:00 | 7126|
|DR1 | 2013-01-24 23:00:00 | 2013-04-20 23:30:00 | 2964|
|DR2 | 2014-03-01 19:00:00 | 2014-05-07 18:00:00 | 8884|
|DR2 | 2015-09-04 09:00:00 | 2015-11-04 07:00:00 | 7828|
|DR2 | 2013-10-04 05:00:00 | 2013-12-24 07:00:00 | 5700|
+-----+-------------------+-------------------+------+
scala>:粘贴
//进入粘贴模式(按ctrl-D键完成)
def months_范围(a:java.sql.Date,b:java.sql.Date):Seq[String]=
{
导入java.time_
导入java.time.format_
val start=a.toLocalDate
val end=b.托洛卡第
(start.toEpochDay至end.toEpochDay).map(LocalDate.ofEpochDay()).map(DateTimeFormatter.ofPattern(“MMM yy”).format()).toSet.toSeq
}
//正在退出粘贴模式,现在正在解释。
月份范围:(a:java.sql.Date,b:java.sql.Date)Seq[String]
scala>val-udf\u-months\u-range=udf(months\u-range(\uz:java.sql.Date,\uz:java.sql.Date):Seq[String])
udf_months_range:org.apache.spark.sql.expressions.UserDefinedFunction=UserDefinedFunction(,ArrayType(StringType,true),Some(List(DateType,DateType)))
scala>val df2=df.withColumn(“天”,datediff('end,'start))。withColumn(“diff_months”,udf_months_range('start,'end))
df2:org.apache.spark.sql.DataFrame=[drone:string,start:timestamp…还有4个字段]
scala>df2.show(false)
+-----+-------------------+-------------------+------+----+--------------------------------+
|无人机|开始|结束|结果|天数|不同月份|
+-----+-------------------+-------------------+------+----+--------------------------------+
|DR1 | 2013-06-16 10:30:00 | 2013-08-22 07:00:00 | 2786 | 67 |[六月十三日、七月十三日、八月十三日]|
|DR1 | 2013-04-20 23:30:00 | 2013-06-16 10:30:00 | 7126 | 57 |[4月13日,5月13日,6月13日]|
|DR1 | 2013-01-24 23:00:00 | 2013-04-20 23:30:00 | 2964 | 86 |[一月十三日、二月十三日、三月十三日、四月十三日]|
|DR2 | 2014-03-01 19:00:00 | 2014-05-07 18:00:00 | 8884 | 67 |[3月14日,4月14日,5月14日]|
|DR2 | 2015-09-04 09:00:00 | 2015-11-04 07:00:00 | 7828 | 61 |[九月十五日,十月十五日,十一月十五日]|
|DR2 | 2013-10-04 05:00:00 | 2013-12-24 07:00:00 | 5700 | 81 |[10-13,11-13,12-13]|
+-----+-------------------+-------------------+------+----+--------------------------------+
scala>df2.withColumn(“月”,explode(“差异月”)。withColumn(“平均值”,“结果/”天)。选择(“无人机”,“月”,“日”,“平均值”)。显示(假)
+-----+------+----+------------------+
|无人机|月|日|平均|
+-----+------+----+------------------+
|DR1 | Jun-13 | 67 | 41.582089552238806|
|DR1 | 7月13日| 67 | 41.582089552238806|
|DR1 | 8月13日| 67 | 41.582089552238806|
|DR1 | Apr-13 | 57 | 125.01754385964912|
|DR1 | 5月13日| 57 | 125.01754385964912|
|DR1 | Jun-13 | 57 | 125.01754385964912|
|DR1 | 1月13日| 86 | 34.46511627906977|
|DR1 | 2月13日| 86 | 34.46511627906977|
|DR1 | 3月13日| 86 | 34.46511627906977|
|DR1 | Apr-13 | 86 | 34.46511627906977|
|DR2 2014年3月67日132.59701492537314|
|DR2 2014年4月67日132.59701492537314|
|DR2 | 5月14日| 67 | 132.59701492537314|
|DR2 | Sep-15 | 61 | 128.327868852459|
|DR2 | 10月15日| 61 | 128.327868852459|
|DR2 | 11月15日| 61 | 128.327868852459|
|DR2 | 10月13日| 81 | 70.370370370337|
|DR2 | 11月13日| 81 | 70.370370370037037|
|DR2 | Dec-13 | 81 | 70.37037037037|
+-----+------+----+------------------+
斯卡拉>
EDIT1
基于每月的天数拆分。必须从自定义项更改代码
scala> :paste
// Entering paste mode (ctrl-D to finish)
def months_range(a:java.sql.Date,b:java.sql.Date)=
{
import java.time._
import java.time.format._
val start = a.toLocalDate
val end = b.toLocalDate
(start.toEpochDay until end.toEpochDay).map(LocalDate.ofEpochDay(_)).map(DateTimeFormatter.ofPattern("MMM-yy").format(_)).groupBy(identity).map( x => (x._1,x._2.length) )
}
// Exiting paste mode, now interpreting.
months_range: (a: java.sql.Date, b: java.sql.Date)scala.collection.immutable.Map[String,Int]
scala> val udf_months_range = udf( months_range(_:java.sql.Date,_:java.sql.Date):Map[String,Int] )
udf_months_range: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,MapType(StringType,IntegerType,false),Some(List(DateType, DateType)))
scala> val df2 = df.withColumn("days",datediff('end,'start)).withColumn("diff_months",udf_months_range('start,'end))
df2: org.apache.spark.sql.DataFrame = [drone: string, start: timestamp ... 4 more fields]
scala> val df3=df2.select(col("*"),explode('diff_months).as(Seq("month","month_days")) ).withColumn("mnth_rent",'result*('month_days/'days)).select("drone","month","month_days","days","mnth_rent")
df3: org.apache.spark.sql.DataFrame = [drone: string, month: string ... 3 more fields]
scala> df3.show(false)
+-----+------+----------+----+------------------+
|drone|month |month_days|days|mnth_rent |
+-----+------+----------+----+------------------+
|DR1 |Aug-13|21 |67 |873.223880597015 |
|DR1 |Jul-13|31 |67 |1289.044776119403 |
|DR1 |Jun-13|15 |67 |623.7313432835821 |
|DR1 |May-13|31 |57 |3875.543859649123 |
|DR1 |Apr-13|11 |57 |1375.1929824561403|
|DR1 |Jun-13|15 |57 |1875.2631578947367|
|DR1 |Apr-13|19 |86 |654.8372093023256 |
|DR1 |Feb-13|28 |86 |965.0232558139536 |
|DR1 |Mar-13|31 |86 |1068.4186046511627|
|DR1 |Jan-13|8 |86 |275.72093023255815|
|DR2 |Apr-14|30 |67 |3977.910447761194 |
|DR2 |Mar-14|31 |67 |4110.507462686567 |
|DR2 |May-14|6 |67 |795.5820895522388 |
|DR2 |Nov-15|3 |61 |384.983606557377 |
|DR2 |Oct-15|31 |61 |3978.1639344262294|
|DR2 |Sep-15|27 |61 |3464.8524590163934|
|DR2 |Nov-13|30 |81 |2111.111111111111 |
|DR2 |Oct-13|28 |81 |1970.3703703703702|
|DR2 |Dec-13|23 |81 |1618.5185185185185|
+-----+------+----------+----+------------------+
scala> df3.groupBy('drone,'month).agg(sum('month_days).as("s_month_days"),sum('mnth_rent).as("mnth_rent"),max('days).as("days")).orderBy('drone,'month).show(false)
+-----+------+------------+------------------+----+
|drone|month |s_month_days|mnth_rent |days|
+-----+------+------------+------------------+----+
|DR1 |Apr-13|30 |2030.030191758466 |86 |
|DR1 |Aug-13|21 |873.223880597015 |67 |
|DR1 |Feb-13|28 |965.0232558139536 |86 |
|DR1 |Jan-13|8 |275.72093023255815|86 |
|DR1 |Jul-13|31 |1289.044776119403 |67 |
|DR1 |Jun-13|30 |2498.994501178319 |67 |
|DR1 |Mar-13|31 |1068.4186046511627|86 |
|DR1 |May-13|31 |3875.543859649123 |57 |
|DR2 |Apr-14|30 |3977.910447761194 |67 |
|DR2 |Dec-13|23 |1618.5185185185185|81 |
|DR2 |Mar-14|31 |4110.507462686567 |67 |
|DR2 |May-14|6 |795.5820895522388 |67 |
|DR2 |Nov-13|30 |2111.111111111111 |81 |
|DR2 |Nov-15|3 |384.983606557377 |61 |
|DR2 |Oct-13|28 |1970.3703703703702|81 |
|DR2 |Oct-15|31 |3978.1639344262294|61 |
|DR2 |Sep-15|27 |3464.8524590163934|61 |
+-----+------+------------+------------------+----+
scala>
scala>:粘贴
//进入粘贴模式(按ctrl-D键完成)
def months_范围(a:java.sql.Date,b:java.sql.Date)=
{
导入java.time_
导入java.time.format_
val start=a.toLocalDate
val end=b.托洛卡第
(start.toEpochDay至end.toEpochDay).map(LocalDate.ofEpochDay()).map(DateTimeFormatter.ofPattern(“MMM-yy”).format()).groupBy(identity).map(x=>(x.1,x.2.长度))
}
//退出过去
scala> :paste
// Entering paste mode (ctrl-D to finish)
def months_range(a:java.sql.Date,b:java.sql.Date)=
{
import java.time._
import java.time.format._
val start = a.toLocalDate
val end = b.toLocalDate
(start.toEpochDay until end.toEpochDay).map(LocalDate.ofEpochDay(_)).map(DateTimeFormatter.ofPattern("MMM-yy").format(_)).groupBy(identity).map( x => (x._1,x._2.length) )
}
// Exiting paste mode, now interpreting.
months_range: (a: java.sql.Date, b: java.sql.Date)scala.collection.immutable.Map[String,Int]
scala> val udf_months_range = udf( months_range(_:java.sql.Date,_:java.sql.Date):Map[String,Int] )
udf_months_range: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,MapType(StringType,IntegerType,false),Some(List(DateType, DateType)))
scala> val df2 = df.withColumn("days",datediff('end,'start)).withColumn("diff_months",udf_months_range('start,'end))
df2: org.apache.spark.sql.DataFrame = [drone: string, start: timestamp ... 4 more fields]
scala> val df3=df2.select(col("*"),explode('diff_months).as(Seq("month","month_days")) ).withColumn("mnth_rent",'result*('month_days/'days)).select("drone","month","month_days","days","mnth_rent")
df3: org.apache.spark.sql.DataFrame = [drone: string, month: string ... 3 more fields]
scala> df3.show(false)
+-----+------+----------+----+------------------+
|drone|month |month_days|days|mnth_rent |
+-----+------+----------+----+------------------+
|DR1 |Aug-13|21 |67 |873.223880597015 |
|DR1 |Jul-13|31 |67 |1289.044776119403 |
|DR1 |Jun-13|15 |67 |623.7313432835821 |
|DR1 |May-13|31 |57 |3875.543859649123 |
|DR1 |Apr-13|11 |57 |1375.1929824561403|
|DR1 |Jun-13|15 |57 |1875.2631578947367|
|DR1 |Apr-13|19 |86 |654.8372093023256 |
|DR1 |Feb-13|28 |86 |965.0232558139536 |
|DR1 |Mar-13|31 |86 |1068.4186046511627|
|DR1 |Jan-13|8 |86 |275.72093023255815|
|DR2 |Apr-14|30 |67 |3977.910447761194 |
|DR2 |Mar-14|31 |67 |4110.507462686567 |
|DR2 |May-14|6 |67 |795.5820895522388 |
|DR2 |Nov-15|3 |61 |384.983606557377 |
|DR2 |Oct-15|31 |61 |3978.1639344262294|
|DR2 |Sep-15|27 |61 |3464.8524590163934|
|DR2 |Nov-13|30 |81 |2111.111111111111 |
|DR2 |Oct-13|28 |81 |1970.3703703703702|
|DR2 |Dec-13|23 |81 |1618.5185185185185|
+-----+------+----------+----+------------------+
scala> df3.groupBy('drone,'month).agg(sum('month_days).as("s_month_days"),sum('mnth_rent).as("mnth_rent"),max('days).as("days")).orderBy('drone,'month).show(false)
+-----+------+------------+------------------+----+
|drone|month |s_month_days|mnth_rent |days|
+-----+------+------------+------------------+----+
|DR1 |Apr-13|30 |2030.030191758466 |86 |
|DR1 |Aug-13|21 |873.223880597015 |67 |
|DR1 |Feb-13|28 |965.0232558139536 |86 |
|DR1 |Jan-13|8 |275.72093023255815|86 |
|DR1 |Jul-13|31 |1289.044776119403 |67 |
|DR1 |Jun-13|30 |2498.994501178319 |67 |
|DR1 |Mar-13|31 |1068.4186046511627|86 |
|DR1 |May-13|31 |3875.543859649123 |57 |
|DR2 |Apr-14|30 |3977.910447761194 |67 |
|DR2 |Dec-13|23 |1618.5185185185185|81 |
|DR2 |Mar-14|31 |4110.507462686567 |67 |
|DR2 |May-14|6 |795.5820895522388 |67 |
|DR2 |Nov-13|30 |2111.111111111111 |81 |
|DR2 |Nov-15|3 |384.983606557377 |61 |
|DR2 |Oct-13|28 |1970.3703703703702|81 |
|DR2 |Oct-15|31 |3978.1639344262294|61 |
|DR2 |Sep-15|27 |3464.8524590163934|61 |
+-----+------+------------+------------------+----+
scala>