Scala 如何将月份添加到日期,其中要添加的月份数将来自列名

Scala 如何将月份添加到日期,其中要添加的月份数将来自列名,scala,dataframe,apache-spark,Scala,Dataframe,Apache Spark,我想在spark scala中分解列 reference_month M M+1 M+2 2020-01-01 10 12 10 2020-02-01 10 12 10 输出应该是 reference_month Month refer

我想在spark scala中分解列

reference_month          M            M+1          M+2       

2020-01-01               10           12           10        

2020-02-01               10           12           10           
输出应该是

reference_month      Month  reference_date_id 
2020-01-01               10     2020-01
2020-01-01               12     2020-02
2020-01-01               10     2020-03
2020-02-01               10     2020-02
2020-02-01               12     2020-03
2020-02-01               10     2020-04
其中reference_date_id=reference_month+x(其中x来自m,m+1,m+2)


有什么方法可以在spark scala中获得这种格式的输出吗?

我们可以使用
M
M+1
M+2
创建一个
数组
,然后分解
数组以获得所需的数据帧

示例:

df.selectExpr("reference_month","array(M,`M+1`,`M+2`)as arr").
selectExpr("reference_month","explode(arr) as Month").show()

+---------------+-----+
|reference_month|Month|
+---------------+-----+
|         202001|   10|
|         202001|   12|
|         202001|   10|
|         202002|   10|
|         202002|   12|
|         202002|   10|
+---------------+-----+


您可以使用ApacheSpark的unpivot技术

import org.apache.spark.sql.functions.expr
data.select($"reference_month",expr("stack(3,`M`,`M+1`,`M+2`) as (Month )")).show()
您可以使用**堆栈**函数
导入系统
从pyspark.sql.types导入StructType、StructField、IntegerType、StringType
从pyspark.sql.functions导入when、concat_ws、lpad、row_number、sum、col、expr、substring、length
从pyspark.sql.window导入窗口
schema=StructType([
StructField(“参考月”,StringType(),True)\
StructField(“M”,IntegerType(),True)\
StructField(“M+1”,IntegerType(),True)\
StructField(“M+2”,IntegerType(),True)
])
mnt=[(“2020-01-01”,10,12,10),(“2020-02-01”,10,12,10)]
df=spark.createDataFrame(mnt,模式)
newdf=df.withColumn(“t”,col(“reference\u month”)。cast(“date”)。drop(“reference\u month”)。withColumn重命名(“t”,“reference\u month”)
exp=expr(“堆栈(3,`M`、`M+1`、`M+2`)作为(值)”)
t=newdf.select(“reference_month”,exp).withColumn('mnth',substring(“reference_month”,6,2)).withColumn(“newmnth”,col(“mnth”).cast(“Integer”).drop('mnth'))
windowval=(Window.partitionBy('reference_month')。orderBy('reference_month')。行之间(-sys.maxsize,0))
ref_cal=t.withColumn(“reference_date_id”,row_number()。超过(windowval)-1)

当(长度(列(“引用日期id”)+列(“newmnth”))时,带列('new_dt',concat_ws('-',子字符串(“引用月份”),1,4)的引用校准M.M+1和M+2元素是数组列还是单独的列?你能发布原始数据框吗?你能分享你到目前为止做了什么吗?你能帮我根据更新的问题生成引用日期id吗?你能在scala中这样做吗?你必须更改导入语句,其余保持不变。引用月份在d中ate格式。能不能更简单一些。我是说2020-02-01+x(x是来自m的月份,m+1…?)这是一个错误。我试图提高它,但没有运气
import org.apache.spark.sql.functions.expr
data.select($"reference_month",expr("stack(3,`M`,`M+1`,`M+2`) as (Month )")).show()
You can use **stack** function 
import sys
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when,concat_ws,lpad,row_number,sum,col,expr,substring,length
from pyspark.sql.window import Window
schema = StructType([
StructField("reference_month", StringType(), True),\
StructField("M", IntegerType(), True),\
StructField("M+1", IntegerType(), True),\
StructField("M+2", IntegerType(), True)
])
mnt = [("2020-01-01",10,12,10),("2020-02-01",10,12,10)]
df=spark.createDataFrame(mnt,schema)
newdf = df.withColumn("t",col("reference_month").cast("date")).drop("reference_month").withColumnRenamed("t","reference_month")
exp = expr("""stack(3,`M`,`M+1`,`M+2`) as (Values)""")
t = newdf.select("reference_month",exp).withColumn('mnth',substring("reference_month",6,2)).withColumn("newmnth",col("mnth").cast("Integer")).drop('mnth')
windowval = (Window.partitionBy('reference_month').orderBy('reference_month').rowsBetween(-sys.maxsize, 0))
ref_cal=t.withColumn("reference_date_id",row_number().over(windowval)-1)
ref_cal.withColumn('new_dt',concat_ws('-',substring("reference_month",1,4),when(length(col("reference_date_id")+col("newmnth"))<2,lpad(col("reference_date_id")+col("newmnth"),2,'0')).otherwise(col("reference_date_id")+col("newmnth")))).drop("newmnth","reference_date_id").withColumnRenamed("new_dt","reference_date_id").orderBy("reference_month").show()

+---------------+------+-----------------+
|reference_month|Values|reference_date_id|
+---------------+------+-----------------+
|     2020-01-01|    10|          2020-01|
|     2020-01-01|    12|          2020-02|
|     2020-01-01|    10|          2020-03|
|     2020-02-01|    10|          2020-02|
|     2020-02-01|    12|          2020-03|
|     2020-02-01|    10|          2020-04|
+---------------+------+-----------------+