Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/jpa/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark scala更改dataframe中列的数据类型_Dataframe_Apache Spark_Apache Spark Sql - Fatal编程技术网

Spark scala更改dataframe中列的数据类型

Spark scala更改dataframe中列的数据类型,dataframe,apache-spark,apache-spark-sql,Dataframe,Apache Spark,Apache Spark Sql,条件是:以Data-C开头的列名是StringType列,Data-D是DateType列,Data-N是DoubleType列。我有一个dataframe,其中所有列的数据类型都是一个字符串,因此我尝试以以下方式更新它们的数据类型: import org.apache.spark.sql.functions._ import sparkSession.sqlContext.implicits._ val diff_set = Seq("col7", "col8&

条件是:以Data-C开头的列名是StringType列,Data-D是DateType列,Data-N是DoubleType列。我有一个dataframe,其中所有列的数据类型都是一个字符串,因此我尝试以以下方式更新它们的数据类型:

import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._

val diff_set = Seq("col7", "col8", "col15", "Data-C-col1", "Data-C-col3", "Data-N-col2", "Data-N-col4", "Data-D-col16", "Data-D-col18", "Data-D-col20").toSet
var df = (1 to 10).toDF
df = df.select(df.columns.map(c => col(c).as(c)) ++ diff_set.map(c => lit(null).cast("string").as(c)): _*)
df.printSchema()

// This foreach loop yields slow performance
    df.columns.foreach(x => {
      if (x.startsWith("Data-C")) {
        df = df.withColumn(x, col(x).cast(StringType))
      } else if (x.startsWith("Data-D")) {
        df = df.withColumn(x, col(x).cast(DateType))
      } else if (x.startsWith("Data-N")) {
        df = df.withColumn(x, col(x).cast(DoubleType))
      }
    }
    )
df.printSchema()
在scala spark中是否可以更优雅、更高效地实现这一点(性能方面)?

检查下面的代码

scala> df.printSchema
root
 |-- value: integer (nullable = false)
 |-- Data-C-col1: string (nullable = true)
 |-- Data-D-col18: string (nullable = true)
 |-- Data-N-col4: string (nullable = true)
 |-- Data-N-col2: string (nullable = true)
 |-- col15: string (nullable = true)
 |-- Data-D-col16: string (nullable = true)
 |-- Data-D-col20: string (nullable = true)
 |-- col8: string (nullable = true)
 |-- col7: string (nullable = true)
 |-- Data-C-col3: string (nullable = true)


下面的解决方案有效吗??
val colum_datatype_mapping = 
Map(
  "Data-C" -> "string",
  "Data-D" -> "date",
  "Data-N" -> "double"
)
val columns = df
.columns
.map { c =>
          val key = c.split("-").init.mkString("-")
          if(colum_datatype_mapping.contains(key)) 
             col(c).cast(colum_datatype_mapping(key)) 
          else 
             col(c)
}
scala> df.select(columns:_*).printSchema
root
 |-- value: integer (nullable = false)
 |-- Data-C-col1: string (nullable = true)
 |-- Data-D-col18: date (nullable = true)
 |-- Data-N-col4: double (nullable = true)
 |-- Data-N-col2: double (nullable = true)
 |-- col15: string (nullable = true)
 |-- Data-D-col16: date (nullable = true)
 |-- Data-D-col20: date (nullable = true)
 |-- col8: string (nullable = true)
 |-- col7: string (nullable = true)
 |-- Data-C-col3: string (nullable = true)