Scala 如何将月份添加到日期，其中要添加的月份数将来自列名_Scala_Dataframe_Apache Spark

Scala 如何将月份添加到日期，其中要添加的月份数将来自列名

scala dataframe apache-spark

Scala 如何将月份添加到日期，其中要添加的月份数将来自列名,scala,dataframe,apache-spark,Scala,Dataframe,Apache Spark,我想在spark scala中分解列 reference_month M M+1 M+2 2020-01-01 10 12 10 2020-02-01 10 12 10 输出应该是 reference_month Month refer

我想在spark scala中分解列

reference_month          M            M+1          M+2       

2020-01-01               10           12           10        

2020-02-01               10           12           10

输出应该是

reference_month      Month  reference_date_id 
2020-01-01               10     2020-01
2020-01-01               12     2020-02
2020-01-01               10     2020-03
2020-02-01               10     2020-02
2020-02-01               12     2020-03
2020-02-01               10     2020-04

其中reference_date_id=reference_month+x（其中x来自m，m+1，m+2）

有什么方法可以在spark scala中获得这种格式的输出吗？

我们可以使用

，

M+1

，

M+2

创建一个

数组，然后分解数组以获得所需的数据帧
示例：
df.selectExpr("reference_month","array(M,`M+1`,`M+2`)as arr").
selectExpr("reference_month","explode(arr) as Month").show()

+---------------+-----+
|reference_month|Month|
+---------------+-----+
|         202001|   10|
|         202001|   12|
|         202001|   10|
|         202002|   10|
|         202002|   12|
|         202002|   10|
+---------------+-----+


您可以使用ApacheSpark的unpivot技术
import org.apache.spark.sql.functions.expr
data.select($"reference_month",expr("stack(3,`M`,`M+1`,`M+2`) as (Month )")).show()

您可以使用**堆栈**函数
导入系统
从pyspark.sql.types导入StructType、StructField、IntegerType、StringType
从pyspark.sql.functions导入when、concat_ws、lpad、row_number、sum、col、expr、substring、length
从pyspark.sql.window导入窗口
schema=StructType([
StructField（“参考月”，StringType（），True）\
StructField（“M”，IntegerType（），True）\
StructField（“M+1”，IntegerType（），True）\
StructField（“M+2”，IntegerType（），True）
])
mnt=[（“2020-01-01”，10,12,10），（“2020-02-01”，10,12,10）]
df=spark.createDataFrame（mnt，模式）
newdf=df.withColumn（“t”，col（“reference\u month”）。cast（“date”）。drop（“reference\u month”）。withColumn重命名（“t”，“reference\u month”）
exp=expr（“堆栈（3，`M`、`M+1`、`M+2`）作为（值）”）
t=newdf.select（“reference_month”，exp）.withColumn（'mnth'，substring（“reference_month”，6,2））.withColumn（“newmnth”，col（“mnth”）.cast（“Integer”）.drop（'mnth'））
windowval=（Window.partitionBy（'reference_month'）。orderBy（'reference_month'）。行之间（-sys.maxsize，0））
ref_cal=t.withColumn（“reference_date_id”，row_number（）。超过（windowval）-1）
当（长度（列（“引用日期id”）+列（“newmnth”））时，带列（'new_dt'，concat_ws（'-'，子字符串（“引用月份”），1,4）的引用校准M.M+1和M+2元素是数组列还是单独的列？你能发布原始数据框吗？你能分享你到目前为止做了什么吗？你能帮我根据更新的问题生成引用日期id吗？你能在scala中这样做吗？你必须更改导入语句，其余保持不变。引用月份在d中ate格式。能不能更简单一些。我是说2020-02-01+x（x是来自m的月份，m+1…？）这是一个错误。我试图提高它，但没有运气
import org.apache.spark.sql.functions.expr
data.select($"reference_month",expr("stack(3,`M`,`M+1`,`M+2`) as (Month )")).show()

You can use **stack** function 
import sys
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when,concat_ws,lpad,row_number,sum,col,expr,substring,length
from pyspark.sql.window import Window
schema = StructType([
StructField("reference_month", StringType(), True),\
StructField("M", IntegerType(), True),\
StructField("M+1", IntegerType(), True),\
StructField("M+2", IntegerType(), True)
])
mnt = [("2020-01-01",10,12,10),("2020-02-01",10,12,10)]
df=spark.createDataFrame(mnt,schema)
newdf = df.withColumn("t",col("reference_month").cast("date")).drop("reference_month").withColumnRenamed("t","reference_month")
exp = expr("""stack(3,`M`,`M+1`,`M+2`) as (Values)""")
t = newdf.select("reference_month",exp).withColumn('mnth',substring("reference_month",6,2)).withColumn("newmnth",col("mnth").cast("Integer")).drop('mnth')
windowval = (Window.partitionBy('reference_month').orderBy('reference_month').rowsBetween(-sys.maxsize, 0))
ref_cal=t.withColumn("reference_date_id",row_number().over(windowval)-1)
ref_cal.withColumn('new_dt',concat_ws('-',substring("reference_month",1,4),when(length(col("reference_date_id")+col("newmnth"))<2,lpad(col("reference_date_id")+col("newmnth"),2,'0')).otherwise(col("reference_date_id")+col("newmnth")))).drop("newmnth","reference_date_id").withColumnRenamed("new_dt","reference_date_id").orderBy("reference_month").show()

+---------------+------+-----------------+
|reference_month|Values|reference_date_id|
+---------------+------+-----------------+
|     2020-01-01|    10|          2020-01|
|     2020-01-01|    12|          2020-02|
|     2020-01-01|    10|          2020-03|
|     2020-02-01|    10|          2020-02|
|     2020-02-01|    12|          2020-03|
|     2020-02-01|    10|          2020-04|
+---------------+------+-----------------+