Google bigquery 带AVRO或拼花地板的BigQuery嵌套数组的结构

Google bigquery 带AVRO或拼花地板的BigQuery嵌套数组的结构,google-bigquery,avro,parquet,Google Bigquery,Avro,Parquet,我试图将拼花地板数据加载到GoogleBigQuery中,以利用高效的列格式,并且(我希望)避免BigQuery在AVRO文件中缺乏对逻辑类型(日期等)的支持 我的数据包含两级嵌套数组 使用JSON,我可以创建并加载具有所需结构的表: bq mk temp.simple_interval simple_interval_bigquery_schema.json bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval

我试图将拼花地板数据加载到GoogleBigQuery中,以利用高效的列格式,并且(我希望)避免BigQuery在AVRO文件中缺乏对逻辑类型(日期等)的支持

我的数据包含两级嵌套数组

使用JSON,我可以创建并加载具有所需结构的表:

bq mk temp.simple_interval simple_interval_bigquery_schema.json
bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval ~/Desktop/simple_interval.json
bq show temp.simple_interval

   Last modified                    Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- ---------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:21:56   |- file_name: string (required)          3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (repeated)
                    |  |- interval_date: date (required)
                    |  |- quality: string (required)
                    |  +- values: record (repeated)
                    |  |  |- interval: integer (required)
                    |  |  |- value: float (required)
我已经尝试使用AvroParquetWriter使用拼花数据文件创建相同的结构。我的AVRO模式是:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {
      "type": "array",
      "items": {
        "name": "days_record",
        "type": "record",
        "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {
            "type": "array",
            "items": {
              "name": "values_record",
              "type": "record",
              "fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
              ]
            }
          }}
        ]
      }
    }}
  ]
}
从AVRO规范以及我在网上找到的情况来看,似乎有必要以这种方式将“记录”节点嵌套在“阵列”节点中

创建镶木地板文件时,镶木地板工具会将模式报告为:

message simple_interval {
  required binary file_name (UTF8);
  required int64 file_created (TIMESTAMP_MILLIS);
  required binary id (UTF8);
  required int32 interval_length;
  required group days (LIST) {
    repeated group array {
      required int32 interval_date (DATE);
      required binary quality (UTF8);
      required group values (LIST) {
        repeated group array {
          required int32 interval;
          required float value;
        }
      }
    }
  }
}
我将文件加载到BigQuery并检查结果:

bq load --source_format=PARQUET temp.simple_interval ~/Desktop/simple_interval.parquet
bq show temp.simple_interval

   Last modified                      Schema                      Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:05:54   |- file_name: string (required)               3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (required)
                    |  +- array: record (repeated)           <-- extra column
                    |  |  |- interval_date: date (required)
                    |  |  |- quality: string (required)
                    |  |  +- values: record (required)
                    |  |  |  +- array: record (repeated)     <-- extra column
                    |  |  |  |  |- interval: integer (required)
                    |  |  |  |  |- value: float (required)
bq load--source\u format=PARQUET temp.simple\u interval~/Desktop/simple\u interval.PARQUET
bq显示简单间隔的温度
上次修改的架构总行数总字节过期时间分区标签
----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:05:54 |-文件名:字符串(必需)3 246
|-创建的文件:时间戳(必需)
|-id:字符串(必需)
|-间隔长度:整数(必需)
+-天数:记录(必填)

|+-array:record(repeated)我使用了这个avro模式:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {"type":"record","name":"days_", "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {"type":"record", "name":"values_","fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
          ]}}
    ]}}
  ]
}
我用它创建了一个空的avro文件,并运行了以下命令:

bq load --source_format=AVRO <dataset>.<table-name> <avro-file>.avro 

不知道拼花地板,但Avro中的FYI逻辑类型将很快在BQ中得到支持。感谢关于Avro逻辑类型的指针。我期待着。然而,考虑到拼花地板的柱状特性,它似乎更适合BigQuery。对于我生成的简单测试数据(列中有大量重复,这是我的典型数据),Avro大约是JSON大小的25%,而Parquet不到JSON大小的1%!我注意到了关于这个问题的另一个事实,这似乎表明它是特定于拼花地板,或者可能是AvroParquetWriter。当我使用这个Avro方案编写Avro文件时,如果我使用Avro工具将其序列化为JSON,我会得到我想要的简单JSON(没有“array”节点)。当我使用AvroParquetWriter和相同的Avro模式编写拼花文件,并使用拼花工具将其序列化回JSON时,我会得到这些额外的“数组”节点。也许AvroParquetWriter中有一些选项可以控制这一点。我会调查的。如果你能解决你的问题,你能发布一个答案并接受吗?维克托,我想重复
记录。您的示例将它们作为嵌套记录,但不重复。BigQuery模式似乎允许字段是重复记录,而在Avro中,字段可以是数组或记录,但不能同时是数组或记录,从而产生额外的嵌套级别。另外请注意,对于我来说,从Avro文件加载当前是不可能的,因为BigQuery还不支持日期等逻辑类型(当该支持可用时,我将重试。)@JohnHurst有以下功能请求支持Avro时间戳、日期和时间()。您可以启动它以跟踪其进度。
 Last modified                    Schema                    Total Rows   Total Bytes   Expiration   Time Partitioning   Labels   kmsKeyName  
 ----------------- ----------------------------------------- ------------ ------------- ------------ ------------------- -------- ------------ 
  22 May 09:46:02   |- file_name: string (required)           0            0                                                                   
                    |- file_created: integer (required)                                                                                        
                    |- id: string (required)                                                                                                   
                    |- interval_length: integer (required)                                                                                     
                    +- days: record (required)                                                                                                 
                    |  |- interval_date: integer (required)                                                                                    
                    |  |- quality: string (required)                                                                                           
                    |  +- values: record (required)                                                                                            
                    |  |  |- interval: integer (required)                                                                                      
                    |  |  |- value: float (required)