Scala Spark：修剪嵌套列/字段_Scala_Apache Spark_Apache Spark Sql_Spark Dataframe_Apache Spark Dataset

Scala Spark：修剪嵌套列/字段

scala apache-spark

Scala Spark：修剪嵌套列/字段,scala,apache-spark,apache-spark-sql,spark-dataframe,apache-spark-dataset,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,Apache Spark Dataset,我有一个关于修剪嵌套字段的可能性的问题我正在为高能物理数据格式（ROOT）开发一个源代码下面是使用我正在开发的数据源的某个文件的模式 root |-- EventAuxiliary: struct (nullable = true) | |-- processHistoryID_: struct (nullable = true) | | |-- hash_: string (nullable = true) | |-- id_: struct (nulla

我有一个关于修剪嵌套字段的可能性的问题

我正在为高能物理数据格式（ROOT）开发一个源代码

下面是使用我正在开发的数据源的某个文件的模式

 root
 |-- EventAuxiliary: struct (nullable = true)
 |    |-- processHistoryID_: struct (nullable = true)
 |    |    |-- hash_: string (nullable = true)
 |    |-- id_: struct (nullable = true)
 |    |    |-- run_: integer (nullable = true)
 |    |    |-- luminosityBlock_: integer (nullable = true)
 |    |    |-- event_: long (nullable = true)
 |    |-- processGUID_: string (nullable = true)
 |    |-- time_: struct (nullable = true)
 |    |    |-- timeLow_: integer (nullable = true)
 |    |    |-- timeHigh_: integer (nullable = true)
 |    |-- luminosityBlock_: integer (nullable = true)
 |    |-- isRealData_: boolean (nullable = true)
 |    |-- experimentType_: integer (nullable = true)
 |    |-- bunchCrossing_: integer (nullable = true)
 |    |-- orbitNumber_: integer (nullable = true)
 |    |-- storeNumber_: integer (nullable = true)

数据源在这里

使用FileFormat的buildReader方法生成读取器时：

override def buildReaderWithPartitionValues(
    sparkSession: SparkSession,
    dataSchema: StructType,
    partitionSchema: StructType,
    requiredSchema: StructType,
    filters: Seq[Filter],
    options: Map[String, String],
    hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

我看到requiredSchema将始终包含正在查看的顶部列的所有字段/成员。这意味着，当我想选择一个特定的嵌套字段时： select（“EventAuxiliary.id\ux.run\ux”），requiredSchema将再次成为该顶部列（“EventAuxiliary”）的完整结构。我希望模式是这样的：

root
|-- EventAuxiliary: struct...
|  |-- id_: struct ...
|  |    |-- run_: integer

因为这是select语句所需的唯一模式

基本上，我想知道如何在数据源级别修剪嵌套字段。我认为requiredSchema将只是来自df.select的字段

我正在尝试查看avro/parquet正在做什么，并发现：

如果有建议/意见-将不胜感激

谢谢

我认为这是一个非常常见的问题，MongoDB插件已经修复了它，您可以看到。elasticsearch插件还没有：/See