如何在spark scala中操作两个数据帧

如何在spark scala中操作两个数据帧,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两张桌子 1.库存:当天我拥有的自行车数量 2.销售额:去年同一天售出的自行车数量 我想根据去年的数据预测我能卖出股票的天数 例如,如果我在2018-07-26上有80辆KTM自行车,那么我可以在3天内出售它们 我该怎么做? Ignore 2017-07-25 as this is more than one year old on 2017-07-26 (80-15=65) on 2017-07-27 (55-40=15) on 2017-07-28 (15-50=-35) so i c

我有两张桌子
1.库存:当天我拥有的自行车数量
2.销售额:去年同一天售出的自行车数量

我想根据去年的数据预测我能卖出股票的天数
例如,如果我在2018-07-26上有80辆KTM自行车,那么我可以在3天内出售它们 我该怎么做?

Ignore 2017-07-25 as this is more than one year old
on 2017-07-26 (80-15=65)
on 2017-07-27 (55-40=15)
on 2017-07-28 (15-50=-35)

so i can sell them in 3 days 
我可以在一天内卖掉你

stocks table
+----+----------+-----+
|Bike|      Date|Units|
+----+----------+-----+
| KTM|2018-07-26|   80|
|  RE|2018-07-26|   40|
+----+----------+-----+
第二桌

sales table
+----+----------+-----------+
|Bike|      Date|Saled_units|
+----+----------+-----------+
| KTM|2017-07-25|         10|
| KTM|2017-07-26|         15|
| KTM|2017-07-27|         40|
| KTM|2017-07-28|         50|
| KTM|2017-07-29|         30|
|  RE|2017-07-26|         50|
+----+----------+-----------+

如何使用spark SQL执行此操作?

您可以按照以下步骤获得所需的输出

1.创建
DataFrames
并将stocks
“Bike”
列重命名为
“Bike\u Temp”

2.
通过
Bike
对销售数据
dataframe
进行分组
并使用
aggregate
功能将剩余列收集为
array

3.最后通过
Bike
列加入数据框
agrouped\u sales
stocks\u temp

 val sales=Seq(("KTM","2017-07-26",10),("KTM","2017-07-27",15),("KTM","2017-07-28",40),("KTM","2017-07-29",50),("KTM","2017-07-30",30),("RE","2017-07-27",50)).toDF("Bike","Date","Saled_units")
 val stocks=Seq(("KTM","2018-07-27",80),("RE","2018-07-27",40)).toDF("Bike","Date","Units")
 val stocks_temp=stocks.withColumnRenamed("Bike","Bike_Temp")
 val grouped_sales=sales.groupBy("Bike").agg(collect_list(struct("Date","Saled_units")).as("date_saled_units"))
 val joineddf=grouped_sales.join(stocks_temp,grouped_sales("Bike")===stocks_temp("Bike_Temp"))
4.创建一个
udf
并注册它,然后将逻辑写入
count
udf
中的天数

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.util.Calendar
import spark.implicits._

 def getNoOfDays(arr:Seq[Row],stock_units:Int):Int={
        val dateFormat=new SimpleDateFormat("yyyy-MM-dd")
        val cal = Calendar.getInstance()
        //Setting last year date
        cal.add(Calendar.YEAR, -1)
        val previousYearTodaysDate =dateFormat.parse(dateFormat.format(cal.getTime()))
        //Iterating over collected list and converting arr<struct<>> to Seq[Date,Int]
        val dateBikeArr=arr.map(row=>(dateFormat.parse(row.getAs("Date").toString),row.getAs("Saled_units").toString.toInt))
        var noOfDays=0
        var totalUnits=0
        //In all the dates checkng for date which is greater than or equal to today's date and summing the total bikes
        for(tup<-dateBikeArr){
          if(tup._1.compareTo(previousYearTodaysDate)>=0)
          {
            totalUnits=totalUnits+tup._2
            noOfDays=noOfDays+1
            if(totalUnits>=stock_units)
              return noOfDays
          }
        }
        return noOfDays
    }

 val getNoOfDaysUDF=udf(getNoOfDays _)
6.样本输出:

+----+-------------+
|Bike|days_required|
+----+-------------+
|  RE|            1|
| KTM|            3|
+----+-------------+
我出于测试目的修改了日期,并假设您将始终检查当前日期。希望这对你有帮助

+----+-------------+
|Bike|days_required|
+----+-------------+
|  RE|            1|
| KTM|            3|
+----+-------------+