Python 3.x 在pyspark sql select语句中动态创建列_Python 3.x_Pyspark_Apache Spark Sql

Python 3.x 在pyspark sql select语句中动态创建列

python-3.x pyspark

Python 3.x 在pyspark sql select语句中动态创建列,python-3.x,pyspark,apache-spark-sql,Python 3.x,Pyspark,Apache Spark Sql,我有一个名为unique_attributes的pyspark数据帧。dataframe包含列productname、productbrand、producttype、weight、id。我正在按一些列进行分区，并尝试使用窗口函数获取id列的第一个值。我希望能够动态地将列列表传递给分区。因此，例如，如果我想将weight列添加到分区中，而不必在select中编写另一个'col（'weight'），只需传递一个列表即可。有人对如何做到这一点有什么建议吗？下面我举一个例子当前代码： w2 = Wi

我有一个名为unique_attributes的pyspark数据帧。dataframe包含列productname、productbrand、producttype、weight、id。我正在按一些列进行分区，并尝试使用窗口函数获取id列的第一个值。我希望能够动态地将列列表传递给分区。因此，例如，如果我想将weight列添加到分区中，而不必在select中编写另一个'col（'weight'），只需传递一个列表即可。有人对如何做到这一点有什么建议吗？下面我举一个例子

当前代码：

w2 = Window().partitionBy(['productname', 
                                  'productbrand', 
                                  'producttype']).orderBy(unique_attributes.id.asc())

first_item_id_df=unique_attributes\
.select(col('productname'),
        col('productbrand'),
        col('producttype')),first("id",True).over(w2).alias('matchid')).distinct()

所需的动态代码：

column_list=['productname', 
             'productbrand', 
             'producttype',
             'weight']

w2 = Window().partitionBy(column_list).orderBy(unique_attributes.id.asc())

# somehow creates

first_item_id_df=unique_attributes\
.select(col('productname'),
        col('productbrand'),
        col('producttype'), col('weight'),first("id",True).over(w2).alias('matchid')).distinct()

只需使用

*column\u list

@mck感谢您这么快回复我，您的意思是：第一项\u id\u df=unique\u attributes\。选择（*column\u list，first（“id”，True）。over（w2）。别名（'matchid'））。distinct（）