在Spark Scala中重命名数据帧的列名_Scala_Apache Spark_Dataframe_Apache Spark Sql

在Spark Scala中重命名数据帧的列名

scala apache-spark dataframe

在Spark Scala中重命名数据帧的列名,scala,apache-spark,dataframe,apache-spark-sql,Scala,Apache Spark,Dataframe,Apache Spark Sql,我试图在Spark Scala中转换DataFrame的所有标题/列名。到现在为止，我提出了以下代码，它只替换了一个列名 for( i <- 0 to origCols.length - 1) { df.withColumnRenamed( df.columns(i), df.columns(i).toLowerCase ); } 对于（i如果结构是扁平的： val df = Seq((1L, "a", "foo", 3.0)).toDF df.printSch

我试图在Spark Scala中转换

DataFrame

的所有标题/列名。到现在为止，我提出了以下代码，它只替换了一个列名

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

对于（i如果结构是扁平的：
val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

最简单的方法是使用toDF
方法：
val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

如果要重命名单个列，可以使用选择和别名：
df.select($"_1".alias("x1"))

可以很容易地概括为多列：
val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

或重命名为列的：
df.withColumnRenamed("_1", "x1")

与foldLeft
一起使用以重命名多个列：
val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

对于嵌套结构（structs
），一个可能的选项是通过选择整个结构重命名：
val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

请注意，它可能会影响nullability
元数据。另一种可能是通过强制转换重命名：
nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

对于那些对PySpark版本感兴趣的人（实际上在Scala中也是一样的-请参见下面的评论）：
merchants\u df\u重命名=merchants\u df.toDF(
“商户id”、“类别”、“子类别”、“商户”）
商户_df_重命名。printSchema（）

结果:
根

|--商户id:整数（可空=真）

|--类别：字符串（nullable=true）

|--子类别：字符串（nullable=true）

|--商户：字符串（可空=真）
如果不明显，这会为每个当前列名添加一个前缀和后缀。当您有两个表，其中一个或多个列具有相同的名称，并且您希望连接它们，但仍然能够消除结果表中的列的歧义时，这会很有用。如果有类似的方法在“普通”SQL。假设数据帧df有3列id1、name1、price1
您希望将它们重命名为id2、name2、price2
val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

我发现这种方法在很多情况下都很有用。
tow table join不重命名联接键
//方法1：创建新的DF
day1=day1.toDF（day1.columns.map（x=>if（x.equals（key））x else s“${x}\u d1”）：*
//方法2：使用WithColumnRename
（x，y）（x，s“${x}u d1”））{
day1=day1.withColumnRename（x，y）
}

工作！
Hi@zero323当使用withColumnRenamed时，我正在获取分析异常无法解析给定的输入列“CC8.1”…即使CC8.1在DataFrame中可用，它也会失败。请指导。@u449355我不清楚这是嵌套列还是包含点的列。在后一种情况下，倒勾应该可以工作（至少在一些基本情况下）在中，是什么意思：
回答Anton Kim的问题：：
就是所谓的scala“splat”“接线员。它基本上将类似数组的东西分解成一个不包含的列表，当您想要将数组传递给一个接受任意数量参数的函数，但没有一个接受列表[]
的版本时，这非常有用。如果您对Perl非常熟悉，那么这就是some_函数（@my_数组）#“splatted”
和some_函数（\@my_数组）#未splatted之间的区别。。。在perl中，反斜杠“\”运算符返回对某事物的引用。df.select（df.columns.map（c=>col（c）.as（lookup.getOrElse（c，c））：\u*）
。。请你把它分解一下好吗？尤其是lookup.getOrElse（c，c）
部分。使用toDF（）
重命名数据框中的列时必须小心。这种方法比其他方法慢得多。我发现DataFrame包含100M条记录，对其进行简单的计数查询需要~3s，而使用toDF（）
method的相同查询需要~16s。但是当使用选择col作为col\u new方法重命名时，我又得到了~3s。快5倍多！Spark 2.3.2.3您一定会喜欢它，漂亮而优雅
Sometime we have the column name is below format in SQLServer or MySQL table

Ex  : Account Number,customer number

But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.

Solution:

val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)