更新数据帧架构读取Spark Scala_Scala_Apache Spark_Dataframe_Schema

更新数据帧架构读取Spark Scala

scala apache-spark dataframe

更新数据帧架构读取Spark Scala,scala,apache-spark,dataframe,schema,Scala,Apache Spark,Dataframe,Schema,我试图从hdfs读入一个模式以加载到我的数据帧中。这允许模式更新并驻留在Spark Scala代码之外。我想知道最好的方法是什么？下面是我目前在代码中的内容 val schema_example = StructType(Array( StructField("EXAMPLE_1", StringType, true), StructField("EXAMPLE_2", StringType, true), StructField("EXAMPLE_3", StringT

我试图从hdfs读入一个模式以加载到我的数据帧中。这允许模式更新并驻留在Spark Scala代码之外。我想知道最好的方法是什么？下面是我目前在代码中的内容

val schema_example = StructType(Array(
    StructField("EXAMPLE_1", StringType, true),
    StructField("EXAMPLE_2", StringType, true),
    StructField("EXAMPLE_3", StringType, true))

def main(args: Array[String]): Unit = { 
   val df_example = get_df("example.txt", schema_example)
}

def get_df(filename: String, schema: StructType): DataFrame = {
    val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","~")
      .schema(schema)
      .option("quote", "'")
      .option("quoteMode", "ALL")
      .load(filename)
    df.select(df.columns.map(c => trim(col(c)).alias(c)): _*)
  }

最好是从HOCON配置文件中读取模式，该文件可以根据需要进行更新

schema[
  {
     columnName = EXAMPLE_1
     type = string
  },
  {
     columnName = EXAMPLE_2
     type = string
  },
  {
     columnName = EXAMPLE_3
     type = string
  }
]

您可以使用

ConfigFactory

读取此文件。

这将是维护文件模式的更好、更干净的方法。

您是否考虑过使用parquet dataformat，它在文件中同时包含模式和数据，并支持出色的压缩和优化？