Apache spark Pyspark根据列值生成行
以下是数据输入Apache spark Pyspark根据列值生成行,apache-spark,pyspark,Apache Spark,Pyspark,以下是数据输入 | start | format_date | diff| +-------------------+-------------------+--------+ |2019-11-15 20:30:00|2019-11-15 18:30:00| 4 | 预期产出: start format_date Diff
| start | format_date | diff|
+-------------------+-------------------+--------+
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4 |
预期产出:
start format_date Diff seq
2019-11-15 20:30:00 2019-11-15 18:30:00 4 1
2019-11-15 20:30:00 2019-11-15 18:30:00 4 2
2019-11-15 20:30:00 2019-11-15 18:30:00 4 3
2019-11-15 20:30:00 2019-11-15 18:30:00 4 4
如何根据列的值(差异)生成行?Spark<2.4
您可以使用explode函数
import pyspark.sql.functions as F
import pyspark.sql.types as Types
def rangeArr(diff):
return range(1,diff+1)
rangeUdf = F.udf(rangeArr, Types.ArrayType(Types.IntegerType()))
df = df.withColumn('seqArr', rangeUdf('diff'))
df = df.withColumn('seq', F.explode('seqArr'))
火花<2.4
您可以使用explode函数
import pyspark.sql.functions as F
import pyspark.sql.types as Types
def rangeArr(diff):
return range(1,diff+1)
rangeUdf = F.udf(rangeArr, Types.ArrayType(Types.IntegerType()))
df = df.withColumn('seqArr', rangeUdf('diff'))
df = df.withColumn('seq', F.explode('seqArr'))
Spark 2.4或更高版本的解决方案
from pyspark.sql import functions as F
from pyspark.sql.types import *
df= spark.createDataFrame([["2019-11-15 20:30:00","2019-11-15 18:30:00" ,4]], ["start", "format_date", "diff"])
df.select("*", F.explode(F.sequence(F.lit(1), F.col("diff"))).alias("seq")).show
+-------------------+-------------------+----+---+
| start| format_date|diff|seq|
+-------------------+-------------------+----+---+
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 1|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 2|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 3|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 4|
Spark 2.4或更高版本的解决方案
from pyspark.sql import functions as F
from pyspark.sql.types import *
df= spark.createDataFrame([["2019-11-15 20:30:00","2019-11-15 18:30:00" ,4]], ["start", "format_date", "diff"])
df.select("*", F.explode(F.sequence(F.lit(1), F.col("diff"))).alias("seq")).show
+-------------------+-------------------+----+---+
| start| format_date|diff|seq|
+-------------------+-------------------+----+---+
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 1|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 2|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 3|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 4|
可能的重复可能的重复