Python 使用Pyspark动态重命名数据帧列

Python 使用Pyspark动态重命名数据帧列,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我正在读取一个文件,其中列可以是struct,当它们有值时,也可以是string,当没有数据时。分配给和组的内联示例是struct,并且具有数据 root |-- number: string (nullable = true) |-- assigned_to: struct (nullable = true) | |-- display_value: string (nullable = true) | |-- link: string (nullable = true)

我正在读取一个文件,其中列可以是struct,当它们有值时,也可以是string,当没有数据时。分配给和组的内联示例是struct,并且具有数据

root
 |-- number: string (nullable = true)
 |-- assigned_to: struct (nullable = true)
 |    |-- display_value: string (nullable = true)
 |    |-- link: string (nullable = true)
 |-- group: struct (nullable = true)
 |    |-- display_value: string (nullable = true)
 |    |-- link: string (nullable = true)
为了扁平化JSON,我将执行以下操作:

df23 = spark.read.parquet("dbfs:***/test1.parquet")
val_cols4 = []

#the idea is the day when the data type of the columns in struct I dynamically extract values otherwise create new columns and default to None.
for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append(name+".display_value") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))
for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append("col('"+name+".display_value').alias('"+name+"_value')") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))
现在,如果必须使用val_cols4从dataframe df23中进行选择,则所有结构列都具有相同的名称“display_value”

如何将列重命名为适当的值?我试了以下方法

df23 = spark.read.parquet("dbfs:***/test1.parquet")
val_cols4 = []

#the idea is the day when the data type of the columns in struct I dynamically extract values otherwise create new columns and default to None.
for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append(name+".display_value") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))
for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append("col('"+name+".display_value').alias('"+name+"_value')") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))

这不起作用,当我在数据帧上进行选择时会出错。

您可以在
val\u cols4
中附加一个别名列对象,而不是字符串

from pyspark.sql.functions import col, lit

val_cols4 = []

for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append(col(name+".display_value").alias(name+"_value")) 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))
然后您可以选择列,例如

newdf = df23.select(val_cols4)