Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark:修剪嵌套列/字段_Scala_Apache Spark_Apache Spark Sql_Spark Dataframe_Apache Spark Dataset - Fatal编程技术网

Scala Spark:修剪嵌套列/字段

Scala Spark:修剪嵌套列/字段,scala,apache-spark,apache-spark-sql,spark-dataframe,apache-spark-dataset,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,Apache Spark Dataset,我有一个关于修剪嵌套字段的可能性的问题 我正在为高能物理数据格式(ROOT)开发一个源代码 下面是使用我正在开发的数据源的某个文件的模式 root |-- EventAuxiliary: struct (nullable = true) | |-- processHistoryID_: struct (nullable = true) | | |-- hash_: string (nullable = true) | |-- id_: struct (nulla

我有一个关于修剪嵌套字段的可能性的问题

我正在为高能物理数据格式(ROOT)开发一个源代码

下面是使用我正在开发的数据源的某个文件的模式

 root
 |-- EventAuxiliary: struct (nullable = true)
 |    |-- processHistoryID_: struct (nullable = true)
 |    |    |-- hash_: string (nullable = true)
 |    |-- id_: struct (nullable = true)
 |    |    |-- run_: integer (nullable = true)
 |    |    |-- luminosityBlock_: integer (nullable = true)
 |    |    |-- event_: long (nullable = true)
 |    |-- processGUID_: string (nullable = true)
 |    |-- time_: struct (nullable = true)
 |    |    |-- timeLow_: integer (nullable = true)
 |    |    |-- timeHigh_: integer (nullable = true)
 |    |-- luminosityBlock_: integer (nullable = true)
 |    |-- isRealData_: boolean (nullable = true)
 |    |-- experimentType_: integer (nullable = true)
 |    |-- bunchCrossing_: integer (nullable = true)
 |    |-- orbitNumber_: integer (nullable = true)
 |    |-- storeNumber_: integer (nullable = true)
数据源在这里

使用FileFormat的buildReader方法生成读取器时:

override def buildReaderWithPartitionValues(
    sparkSession: SparkSession,
    dataSchema: StructType,
    partitionSchema: StructType,
    requiredSchema: StructType,
    filters: Seq[Filter],
    options: Map[String, String],
    hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
我看到requiredSchema将始终包含正在查看的顶部列的所有字段/成员。这意味着,当我想选择一个特定的嵌套字段时: select(“EventAuxiliary.id\ux.run\ux”),requiredSchema将再次成为该顶部列(“EventAuxiliary”)的完整结构。我希望模式是这样的:

root
|-- EventAuxiliary: struct...
|  |-- id_: struct ...
|  |    |-- run_: integer
因为这是select语句所需的唯一模式

基本上,我想知道如何在数据源级别修剪嵌套字段。我认为requiredSchema将只是来自df.select的字段

我正在尝试查看avro/parquet正在做什么,并发现:

如果有建议/意见-将不胜感激

谢谢


VK

我认为这是一个非常常见的问题,MongoDB插件已经修复了它,您可以看到。elasticsearch插件还没有:/See