Apache spark Spark SQL为每个子组分配序列号
从下表中我们可以看到,用户正在观看电视频道。在Spark SQL中,我想创建一个名为“CHANGE\u SEQUENCE”的列,其中的逻辑是Apache spark Spark SQL为每个子组分配序列号,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,从下表中我们可以看到,用户正在观看电视频道。在Spark SQL中,我想创建一个名为“CHANGE\u SEQUENCE”的列,其中的逻辑是 CASE WHEN ROW_NUMBER = 1 THEN 1 ELSE CASE WHEN CHANNEL_CHANGED = true THEN ((lag(CHANGE_SEQUENCE,1, CHANGE_SEQUENCE) over(partition by CUSTOMER_ID order by ROW_NUMBE
CASE WHEN ROW_NUMBER = 1
THEN 1
ELSE
CASE WHEN CHANNEL_CHANGED = true
THEN ((lag(CHANGE_SEQUENCE,1, CHANGE_SEQUENCE) over(partition by CUSTOMER_ID order by ROW_NUMBER)) + 1)
ELSE (lag(CHANGE_SEQUENCE,1, CHANGE_SEQUENCE) over(partition by CUSTOMER_ID order by ROW_NUMBER))
END
END
AS CHANGE_SEQUENCE
Input
CUSTOMER_ID | TV_CHANNEL_ID | PREV_CHANNEL_ID | ROW_NUMBER | CHANNEL_CHANGED
1 | 100 | NULL | 1 | FALSE
1 | 100 | 100 | 2 | FALSE
1 | 100 | 100 | 3 | FALSE
1 | 200 | 100 | 4 | TRUE
1 | 200 | 200 | 5 | FALSE
1 | 200 | 200 | 6 | FALSE
1 | 300 | 200 | 7 | TRUE
1 | 300 | 300 | 8 | FALSE
1 | 300 | 300 | 9 | FALSE
Output
CUSTOMER_ID | TV_CHANNEL_ID | PREV_CHANNEL_ID | ROW_NUMBER | CHANNEL_CHANGED | CHANGE_SEQUENCE
1 | 100 | NULL | 1 | FALSE | 1
1 | 100 | 100 | 2 | FALSE | 1
1 | 100 | 100 | 3 | FALSE | 1
1 | 200 | 100 | 4 | TRUE | 2
1 | 200 | 200 | 5 | FALSE | 2
1 | 200 | 200 | 6 | FALSE | 2
1 | 300 | 200 | 7 | TRUE | 3
1 | 300 | 300 | 8 | FALSE | 3
1 | 300 | 300 | 9 | FALSE | 3
从前3条记录开始,客户观看了100频道,这是它的一组,就像wise观看200和300频道一样
请告知。在pyspark中:
from pyspark.sql import functions as f
cfg = SparkConf().setAppName('s')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()
df = spark.read.csv('...')
df_p1 = df.filter((df['ROW_NUMBER'] == '1') | (df['CHANNEL_CHANGED'] == 'TRUE')) \
.withColumn('CHANGE_SEQUENCE', f.row_number().over(Window().partitionBy('CUSTOMER_ID').orderBy('ROW_NUMBER')))
df_p2 = df.filter((df['ROW_NUMBER'] != '1') & (df['CHANNEL_CHANGED'] == 'FALSE')).withColumn('CHANGE_SEQUENCE',f.lit(None))
df = df_p1.union(df_p2)\
.withColumn('CHANGE_SEQUENCE', f.collect_set('CHANGE_SEQUENCE').over(Window().partitionBy('CUSTOMER_ID').orderBy('ROW_NUMBER')))\
.withColumn('CHANGE_SEQUENCE', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('CHANGE_SEQUENCE'))