Apache spark Spark，如何从dataframe获取数据透视列名？_Apache Spark_Pyspark_Apache Spark Sql

Apache spark Spark，如何从dataframe获取数据透视列名？

apache-spark pyspark

Apache spark Spark，如何从dataframe获取数据透视列名？,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我旋转一列，它会生成多个新列我想得到那些列并将其打包到字段下下面的代码给出了我想要的结果。但是我正在手动选择col（“搜索”）、col（“主要”）、col（“主题”），我想知道是否有一种方法可以动态选择所有这些列（我可以说是旋转列？）事实上，我想我找到了答案。虽然我不知道这是表演我从你那里得到了暗示 # I'm going to pivot on the 2nd column mylist = [ [1, 'search', 3, 1], [1, 'searc

我旋转一列，它会生成多个新列

我想得到那些列并将其打包到字段下

下面的代码给出了我想要的结果。
但是我正在手动选择

col（“搜索”）、col（“主要”）、col（“主题”）

，我想知道是否有一种方法可以动态选择所有这些列（我可以说是旋转列？）

事实上，我想我找到了答案。
虽然我不知道这是表演

我从你那里得到了暗示

 # I'm going to pivot on the 2nd column
 mylist = [
     [1, 'search', 3, 1],
     [1, 'search', 3, 2],
     [1, 'main', 5, 3],
     [1, 'main', 6, 4],

     [2, 'search', 4, 10],
     [2, 'search', 4, 11],
     [2, 'main', 6, 12],
     [2, 'main', 6, 13],
     [2, 'theme', 6, 14],

     [3, 'search', 4, 5],
     [3, 'main', 6, 6],
     [3, 'main', 6, 7],
     [3, 'theme', 6, 8],
 ]

 df = pd.DataFrame(mylist, columns=['id', 'origin', 'time', 'screen_index'])

 mylist = df.to_dict('records')
 spark_session = get_spark_session()

 df = spark_session.createDataFrame(Row(**x) for x in mylist)

 df_wanted = df.groupBy("id").pivot('origin').agg(
     struct(count(lit(1)).alias('count'), avg("time").alias('avg_time'))
 ).withColumn(
     #### here I'm manually selecting columns, but want to grab them dynamically because I don't know beforehand what they gonna be.
     "origin_info", struct(col("search"), col("main"), col("theme")) 
 ).select("id", "origin_info")


 df_wanted.printSchema()
 root
  |-- id: long (nullable = true)
  |-- origin_info: struct (nullable = false)
  |    |-- search: struct (nullable = false)
  |    |    |-- count: long (nullable = false)
  |    |    |-- avg_time: double (nullable = true)
  |    |-- main: struct (nullable = false)
  |    |    |-- count: long (nullable = false)
  |    |    |-- avg_time: double (nullable = true)
  |    |-- theme: struct (nullable = false)
  |    |    |-- count: long (nullable = false)
  |    |    |-- avg_time: double (nullable = true)

names = df_wanted.schema.names.copy()
names.remove("id")

columns = [col(name) for name in names]


df_wanted = df_wanted.withColumn(
    "origin_info", struct(*columns)
).select("id", "origin_info")