Apache spark 获取组中的第一个非空值
在Spark SQL中,如何获取组中的第一个not null(或类似not'N/A'的匹配文本)。在下面的示例中,用户正在观看电视频道,前3条记录为频道100,信号强度为N/A,其中下一条记录的值为Good,因此我想使用它 我尝试了windows函数,但我有MAX、MIN等方法 如果我使用铅,我只得到下一行,如果我使用无界,我不;我看不到像fistNotNull这样的方法。请告知 输入?Apache spark 获取组中的第一个非空值,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,在Spark SQL中,如何获取组中的第一个not null(或类似not'N/A'的匹配文本)。在下面的示例中,用户正在观看电视频道,前3条记录为频道100,信号强度为N/A,其中下一条记录的值为Good,因此我想使用它 我尝试了windows函数,但我有MAX、MIN等方法 如果我使用铅,我只得到下一行,如果我使用无界,我不;我看不到像fistNotNull这样的方法。请告知 输入? CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGH
CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| N/A
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| N/A
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || N/A
1 || 200 || 7 || N/A
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good
CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| Good
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| Poor
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || Poor
1 || 200 || 7 || Poor
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good
预期产量?
CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| N/A
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| N/A
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || N/A
1 || 200 || 7 || N/A
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good
CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| Good
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| Poor
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || Poor
1 || 200 || 7 || Poor
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good
实际代码
package com.ganesh.test;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ChannelLoader {
private static final Logger LOGGER = LoggerFactory.getLogger(ChannelLoader.class);
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
//region
SparkSession sparkSession = SparkSession
.builder()
.appName(ChannelLoader.class.getName())
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
Dataset<Row> rawDataset = sparkSession.read()
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "true")
.load("sample_channel.csv");
rawDataset.printSchema();
rawDataset.createOrReplaceTempView("channelView");
//endregion
WindowSpec windowSpec = Window.partitionBy("CUSTOMER_ID").orderBy("TV_CHANNEL_ID");
rawDataset = sqlCtx.sql("select * ," +
" ( isNan(SIGNAL_STRENGHT) over ( partition by CUSTOMER_ID, TV_CHANNEL_ID order by TIME ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) ) as updatedStren " +
" from channelView " +
" order by CUSTOMER_ID, TV_CHANNEL_ID, TIME "
);
rawDataset.show();
sparkSession.close();
}
}
代码输出
+-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+
|CUSTOMER_ID|TV_CHANNEL_ID|TIME|SIGNAL_STRENGTH| fwdValues| bkwdValues|rank_fwd|rank_bkwd|SIGNAL_STRENGTH|NEW_SIGNAL_STRENGTH|
+-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+
| 1| 100| 0| null|[Good, Meduim, Poor]| []| 1| 6| null| Good|
| 1| 100| 1| Good|[Good, Meduim, Poor]| [Good]| 2| 5| Good| Good|
| 1| 100| 2| Meduim| [Meduim, Poor]| [Good, Meduim]| 3| 4| Meduim| Meduim|
| 1| 100| 3| null| [Poor]| [Good, Meduim]| 4| 3| null| Poor|
| 1| 100| 4| Poor| [Poor]|[Good, Meduim, Poor]| 5| 2| Poor| Poor|
| 1| 100| 5| null| []|[Good, Meduim, Poor]| 6| 1| null| Poor|
| 1| 200| 6| null| [Poor]| []| 1| 3| null| Poor|
| 1| 200| 7| null| [Poor]| []| 2| 2| null| Poor|
| 1| 200| 8| Poor| [Poor]| [Poor]| 3| 1| Poor| Poor|
| 1| 300| 10| null| [Good]| []| 1| 3| null| Good|
| 1| 300| 11| null| [Good]| []| 2| 2| null| Good|
| 1| 300| 9| Good| [Good]| [Good]| 3| 1| Good| Good|
+-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+
希望这有帮助
[编辑注释-修改原始问题后更新的解决方案方法]
import pyspark.sql.functions as f
df = sc.parallelize([
[1, 100, 0, None],
[1, 100, 1, 'Good'],
[1, 100, 2, 'Medium'],
[1, 100, 3, None],
[1, 100, 4, 'Poor'],
[1, 100, 5, 'Medium'],
[1, 200, 6, None],
[1, 200, 7, None],
[1, 200, 8, 'Poor'],
[1, 300, 9, 'Good'],
[1, 300,10, 'Good'],
[1, 300,11, 'Good']
]).toDF(('customer_id', 'tv_channel_id', 'time', 'signal_strength'))
df.show()
#convert to pandas dataframe and fill NA as per the requirement then convert it back to spark dataframe
df1 = df.sort('customer_id', 'tv_channel_id','time').select('customer_id', 'tv_channel_id', 'signal_strength')
p_df = df1.toPandas()
p_df["signal_strength"] = p_df.groupby(["customer_id","tv_channel_id"]).transform(lambda x: x.fillna(method='bfill'))
df2= sqlContext.createDataFrame(p_df).withColumnRenamed("signal_strength","signal_strength_new")
#replace 'signal_strength' column of original dataframe with the column of above pandas dataframe
df=df.withColumn('row_index', f.monotonically_increasing_id())
df2=df2.withColumn('row_index', f.monotonically_increasing_id())
final_df = df.join(df2, on=['customer_id', 'tv_channel_id','row_index']).drop("row_index","signal_strength").\
withColumnRenamed("signal_strength_new","signal_strength").\
sort('customer_id', 'tv_channel_id','time')
final_df.show()
输出为:
+-----------+-------------+----+---------------+
|customer_id|tv_channel_id|time|signal_strength|
+-----------+-------------+----+---------------+
| 1| 100| 0| Good|
| 1| 100| 1| Good|
| 1| 100| 2| Medium|
| 1| 100| 3| Poor|
| 1| 100| 4| Poor|
| 1| 100| 5| Medium|
| 1| 200| 6| Poor|
| 1| 200| 7| Poor|
| 1| 200| 8| Poor|
| 1| 300| 9| Good|
| 1| 300| 10| Good|
| 1| 300| 11| Good|
+-----------+-------------+----+---------------+
分组键是什么?客户号+电视频道号?您在哪个字段上定义订单(因为获取“first not null”表示有订单)?我添加了示例代码我不太理解您期望的输出,您能提供一个简单的示例吗@ManjeshDo您是否希望客户id的行数相同或在线一行?我希望行数相同。我已更新了我的示例,我需要底部窗口中的第一个非空。在您可以看到的关于通道100的udpated示例中,我引入了2个空值,中间有一个很好的值…更新了我的答案,现在您应该可以得到所需的输出。