Google bigquery 带AVRO或拼花地板的BigQuery嵌套数组的结构
我试图将拼花地板数据加载到GoogleBigQuery中,以利用高效的列格式,并且(我希望)避免BigQuery在AVRO文件中缺乏对逻辑类型(日期等)的支持 我的数据包含两级嵌套数组 使用JSON,我可以创建并加载具有所需结构的表:Google bigquery 带AVRO或拼花地板的BigQuery嵌套数组的结构,google-bigquery,avro,parquet,Google Bigquery,Avro,Parquet,我试图将拼花地板数据加载到GoogleBigQuery中,以利用高效的列格式,并且(我希望)避免BigQuery在AVRO文件中缺乏对逻辑类型(日期等)的支持 我的数据包含两级嵌套数组 使用JSON,我可以创建并加载具有所需结构的表: bq mk temp.simple_interval simple_interval_bigquery_schema.json bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval
bq mk temp.simple_interval simple_interval_bigquery_schema.json
bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval ~/Desktop/simple_interval.json
bq show temp.simple_interval
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels
----------------- ---------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:21:56 |- file_name: string (required) 3 246
|- file_created: timestamp (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (repeated)
| |- interval_date: date (required)
| |- quality: string (required)
| +- values: record (repeated)
| | |- interval: integer (required)
| | |- value: float (required)
我已经尝试使用AvroParquetWriter使用拼花数据文件创建相同的结构。我的AVRO模式是:
{
"name": "simple_interval",
"type": "record",
"fields": [
{"name": "file_name", "type": "string"},
{"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
{"name": "id", "type": "string"},
{"name": "interval_length", "type": "int"},
{"name": "days", "type": {
"type": "array",
"items": {
"name": "days_record",
"type": "record",
"fields": [
{"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
{"name": "quality", "type": "string"},
{"name": "values", "type": {
"type": "array",
"items": {
"name": "values_record",
"type": "record",
"fields": [
{"name": "interval", "type": "int"},
{"name": "value", "type": "float"}
]
}
}}
]
}
}}
]
}
从AVRO规范以及我在网上找到的情况来看,似乎有必要以这种方式将“记录”节点嵌套在“阵列”节点中
创建镶木地板文件时,镶木地板工具会将模式报告为:
message simple_interval {
required binary file_name (UTF8);
required int64 file_created (TIMESTAMP_MILLIS);
required binary id (UTF8);
required int32 interval_length;
required group days (LIST) {
repeated group array {
required int32 interval_date (DATE);
required binary quality (UTF8);
required group values (LIST) {
repeated group array {
required int32 interval;
required float value;
}
}
}
}
}
我将文件加载到BigQuery并检查结果:
bq load --source_format=PARQUET temp.simple_interval ~/Desktop/simple_interval.parquet
bq show temp.simple_interval
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels
----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:05:54 |- file_name: string (required) 3 246
|- file_created: timestamp (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (required)
| +- array: record (repeated) <-- extra column
| | |- interval_date: date (required)
| | |- quality: string (required)
| | +- values: record (required)
| | | +- array: record (repeated) <-- extra column
| | | | |- interval: integer (required)
| | | | |- value: float (required)
bq load--source\u format=PARQUET temp.simple\u interval~/Desktop/simple\u interval.PARQUET
bq显示简单间隔的温度
上次修改的架构总行数总字节过期时间分区标签
----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:05:54 |-文件名:字符串(必需)3 246
|-创建的文件:时间戳(必需)
|-id:字符串(必需)
|-间隔长度:整数(必需)
+-天数:记录(必填)
|+-array:record(repeated)我使用了这个avro模式:
{
"name": "simple_interval",
"type": "record",
"fields": [
{"name": "file_name", "type": "string"},
{"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
{"name": "id", "type": "string"},
{"name": "interval_length", "type": "int"},
{"name": "days", "type": {"type":"record","name":"days_", "fields": [
{"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
{"name": "quality", "type": "string"},
{"name": "values", "type": {"type":"record", "name":"values_","fields": [
{"name": "interval", "type": "int"},
{"name": "value", "type": "float"}
]}}
]}}
]
}
我用它创建了一个空的avro文件,并运行了以下命令:
bq load --source_format=AVRO <dataset>.<table-name> <avro-file>.avro
不知道拼花地板,但Avro中的FYI逻辑类型将很快在BQ中得到支持。感谢关于Avro逻辑类型的指针。我期待着。然而,考虑到拼花地板的柱状特性,它似乎更适合BigQuery。对于我生成的简单测试数据(列中有大量重复,这是我的典型数据),Avro大约是JSON大小的25%,而Parquet不到JSON大小的1%!我注意到了关于这个问题的另一个事实,这似乎表明它是特定于拼花地板,或者可能是AvroParquetWriter。当我使用这个Avro方案编写Avro文件时,如果我使用Avro工具将其序列化为JSON,我会得到我想要的简单JSON(没有“array”节点)。当我使用AvroParquetWriter和相同的Avro模式编写拼花文件,并使用拼花工具将其序列化回JSON时,我会得到这些额外的“数组”节点。也许AvroParquetWriter中有一些选项可以控制这一点。我会调查的。如果你能解决你的问题,你能发布一个答案并接受吗?维克托,我想重复天
和值
记录。您的示例将它们作为嵌套记录,但不重复。BigQuery模式似乎允许字段是重复记录,而在Avro中,字段可以是数组或记录,但不能同时是数组或记录,从而产生额外的嵌套级别。另外请注意,对于我来说,从Avro文件加载当前是不可能的,因为BigQuery还不支持日期等逻辑类型(当该支持可用时,我将重试。)@JohnHurst有以下功能请求支持Avro时间戳、日期和时间()。您可以启动它以跟踪其进度。
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels kmsKeyName
----------------- ----------------------------------------- ------------ ------------- ------------ ------------------- -------- ------------
22 May 09:46:02 |- file_name: string (required) 0 0
|- file_created: integer (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (required)
| |- interval_date: integer (required)
| |- quality: string (required)
| +- values: record (required)
| | |- interval: integer (required)
| | |- value: float (required)