Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 带特定列的火花读orc_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 带特定列的火花读orc

Apache spark 带特定列的火花读orc,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个orc文件,当使用下面的选项读取时,它会读取所有列 val df= spark.read.orc("/some/path/") df.printSChema root |-- id: string (nullable = true) |-- name: string (nullable = true) |-- value: string (nullable = true) |-- all: string (nullable = true) |-- next: string (

我有一个orc文件,当使用下面的选项读取时,它会读取所有列

val df= spark.read.orc("/some/path/")

df.printSChema
root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- value: string (nullable = true)
 |-- all: string (nullable = true)
 |-- next: string (nullable = true)
 |-- action: string (nullable = true)
但我只想从该文件中读取两列,在加载orc文件时有没有办法只读取两列(id、名称)

加载orc文件时,有没有办法只读取两列(id、名称)

是的,您只需进行后续选择。Spark将为您解决其余问题:

val df = spark.read.orc("/some/path/").select("id", "name")

Spark具有延迟执行模型。因此,您可以在代码中进行任何数据转换,而不会立即产生实际效果。只有在打电话给Spark后才能开始工作。Spark足够聪明,不会做额外的工作。 所以你可以这样写:

val inDF: DataFrame = spark.read.orc("/some/path/")   

import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")

// any additional transformations  

// real work starts after this action 
val result: Array[Row] = filteredDF.collect()