Apache spark 获取组中的第一个非空值_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql

Apache spark 获取组中的第一个非空值

apache-spark pyspark

Apache spark 获取组中的第一个非空值,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,在Spark SQL中，如何获取组中的第一个not null（或类似not'N/A'的匹配文本）。在下面的示例中，用户正在观看电视频道，前3条记录为频道100，信号强度为N/A，其中下一条记录的值为Good，因此我想使用它我尝试了windows函数，但我有MAX、MIN等方法如果我使用铅，我只得到下一行，如果我使用无界，我不；我看不到像fistNotNull这样的方法。请告知输入？ CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGH

在Spark SQL中，如何获取组中的第一个not null（或类似not'N/A'的匹配文本）。在下面的示例中，用户正在观看电视频道，前3条记录为频道100，信号强度为N/A，其中下一条记录的值为Good，因此我想使用它

我尝试了windows函数，但我有MAX、MIN等方法

如果我使用铅，我只得到下一行，如果我使用无界，我不；我看不到像fistNotNull这样的方法。请告知

输入？

CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| N/A
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| N/A
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || N/A
1 || 200 || 7 || N/A
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good

CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| Good
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| Poor
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || Poor
1 || 200 || 7 || Poor
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good

预期产量？

CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| N/A
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| N/A
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || N/A
1 || 200 || 7 || N/A
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good

CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| Good
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| Poor
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || Poor
1 || 200 || 7 || Poor
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good

实际代码

    package com.ganesh.test;

    import org.apache.spark.SparkContext;
    import org.apache.spark.sql.*;
    import org.apache.spark.sql.expressions.Window;
    import org.apache.spark.sql.expressions.WindowSpec;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;

    public class ChannelLoader {

        private static final Logger LOGGER = LoggerFactory.getLogger(ChannelLoader.class);

        public static void main(String[] args) throws AnalysisException {
            String master = "local[*]";
            //region
            SparkSession sparkSession = SparkSession
                    .builder()
                    .appName(ChannelLoader.class.getName())
                    .master(master).getOrCreate();
            SparkContext context = sparkSession.sparkContext();
            context.setLogLevel("ERROR");

            SQLContext sqlCtx = sparkSession.sqlContext();

            Dataset<Row> rawDataset = sparkSession.read()
                    .format("com.databricks.spark.csv")
                    .option("delimiter", ",")
                    .option("header", "true")
                    .load("sample_channel.csv");

            rawDataset.printSchema();

            rawDataset.createOrReplaceTempView("channelView");
            //endregion

            WindowSpec windowSpec = Window.partitionBy("CUSTOMER_ID").orderBy("TV_CHANNEL_ID");

            rawDataset = sqlCtx.sql("select * ," +
                    " ( isNan(SIGNAL_STRENGHT) over ( partition by CUSTOMER_ID, TV_CHANNEL_ID order by TIME ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING )  ) as updatedStren " +
                    " from channelView " +
                    " order by CUSTOMER_ID, TV_CHANNEL_ID, TIME "
            );

            rawDataset.show();

            sparkSession.close();

        }
    }

代码输出

     +-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+
    |CUSTOMER_ID|TV_CHANNEL_ID|TIME|SIGNAL_STRENGTH|           fwdValues|          bkwdValues|rank_fwd|rank_bkwd|SIGNAL_STRENGTH|NEW_SIGNAL_STRENGTH|
    +-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+
    |          1|          100|   0|           null|[Good, Meduim, Poor]|                  []|       1|        6|           null|               Good|
    |          1|          100|   1|           Good|[Good, Meduim, Poor]|              [Good]|       2|        5|           Good|               Good|
    |          1|          100|   2|         Meduim|      [Meduim, Poor]|      [Good, Meduim]|       3|        4|         Meduim|             Meduim|
    |          1|          100|   3|           null|              [Poor]|      [Good, Meduim]|       4|        3|           null|               Poor|
    |          1|          100|   4|           Poor|              [Poor]|[Good, Meduim, Poor]|       5|        2|           Poor|               Poor|
    |          1|          100|   5|           null|                  []|[Good, Meduim, Poor]|       6|        1|           null|               Poor|
    |          1|          200|   6|           null|              [Poor]|                  []|       1|        3|           null|               Poor|
    |          1|          200|   7|           null|              [Poor]|                  []|       2|        2|           null|               Poor|
    |          1|          200|   8|           Poor|              [Poor]|              [Poor]|       3|        1|           Poor|               Poor|
    |          1|          300|  10|           null|              [Good]|                  []|       1|        3|           null|               Good|
    |          1|          300|  11|           null|              [Good]|                  []|       2|        2|           null|               Good|
    |          1|          300|   9|           Good|              [Good]|              [Good]|       3|        1|           Good|               Good|
    +-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+

希望这有帮助

[编辑注释-修改原始问题后更新的解决方案方法]

import pyspark.sql.functions as f

df = sc.parallelize([
    [1, 100, 0, None],
    [1, 100, 1, 'Good'],
    [1, 100, 2, 'Medium'],
    [1, 100, 3, None],
    [1, 100, 4, 'Poor'],
    [1, 100, 5, 'Medium'],
    [1, 200, 6, None],
    [1, 200, 7, None],
    [1, 200, 8, 'Poor'],
    [1, 300, 9, 'Good'],
    [1, 300,10, 'Good'],
    [1, 300,11, 'Good']
]).toDF(('customer_id', 'tv_channel_id', 'time', 'signal_strength'))
df.show()

#convert to pandas dataframe and fill NA as per the requirement then convert it back to spark dataframe
df1 = df.sort('customer_id', 'tv_channel_id','time').select('customer_id', 'tv_channel_id', 'signal_strength')
p_df = df1.toPandas()
p_df["signal_strength"] = p_df.groupby(["customer_id","tv_channel_id"]).transform(lambda x: x.fillna(method='bfill'))
df2= sqlContext.createDataFrame(p_df).withColumnRenamed("signal_strength","signal_strength_new")

#replace 'signal_strength' column of original dataframe with the column of above pandas dataframe
df=df.withColumn('row_index', f.monotonically_increasing_id())
df2=df2.withColumn('row_index', f.monotonically_increasing_id())
final_df = df.join(df2, on=['customer_id', 'tv_channel_id','row_index']).drop("row_index","signal_strength").\
    withColumnRenamed("signal_strength_new","signal_strength").\
    sort('customer_id', 'tv_channel_id','time')
final_df.show()

输出为：

+-----------+-------------+----+---------------+
|customer_id|tv_channel_id|time|signal_strength|
+-----------+-------------+----+---------------+
|          1|          100|   0|           Good|
|          1|          100|   1|           Good|
|          1|          100|   2|         Medium|
|          1|          100|   3|           Poor|
|          1|          100|   4|           Poor|
|          1|          100|   5|         Medium|
|          1|          200|   6|           Poor|
|          1|          200|   7|           Poor|
|          1|          200|   8|           Poor|
|          1|          300|   9|           Good|
|          1|          300|  10|           Good|
|          1|          300|  11|           Good|
+-----------+-------------+----+---------------+

分组键是什么？客户号+电视频道号？您在哪个字段上定义订单（因为获取“first not null”表示有订单）？我添加了示例代码我不太理解您期望的输出，您能提供一个简单的示例吗@ManjeshDo您是否希望客户id的行数相同或在线一行？我希望行数相同。我已更新了我的示例，我需要底部窗口中的第一个非空。在您可以看到的关于通道100的udpated示例中，我引入了2个空值，中间有一个很好的值…更新了我的答案，现在您应该可以得到所需的输出。