Google bigquery BigQuery：基于列拆分表_Google Bigquery_Gcloud

Google bigquery BigQuery：基于列拆分表

google-bigquery

Google bigquery BigQuery：基于列拆分表,google-bigquery,gcloud,Google Bigquery,Gcloud,简短问题：我想根据列的不同值将BQ表拆分为多个小表。因此，如果列country有10个不同的值，它应该将表拆分为10个单独的表，每个表都有各自的country数据。最好是在BQ查询中执行（使用INSERT、MERGE等）我现在正在做的是将数据导入gstorage->local storage->在本地进行拆分，然后将数据推送到表中（这是一个非常耗时的过程）谢谢。如果数据具有相同的模式，只需将其保留在一个表中并使用群集功能：但该功能仍处于测试阶段。您可以使用数据流来实现此功能。给出一个管道

简短问题：我想根据列的不同值将BQ表拆分为多个小表。因此，如果列

country

有10个不同的值，它应该将表拆分为10个单独的表，每个表都有各自的

country

数据。最好是在BQ查询中执行（使用

INSERT

、

MERGE

等）

我现在正在做的是将数据导入gstorage->local storage->在本地进行拆分，然后将数据推送到表中（这是一个非常耗时的过程）

谢谢。

如果数据具有相同的模式，只需将其保留在一个表中并使用群集功能：

但该功能仍处于测试阶段。

您可以使用数据流来实现此功能。给出一个管道示例，该管道查询BigQuery表，根据列拆分行，然后将它们输出到不同的PubSub主题（可以是不同的BigQuery表）：

Pipeline p=Pipeline.create（PipelineOptionsFactory.fromArgs（args）.withValidation（）.create（））；
p收集天气数据=p.apply(
BigQueryIO.Read.named（“ReadWeatherStations”）。来自（“clouddataflow只读：samples.weather_stations”）；
最终TupleTag Reading2010=新TupleTag（）{
};
最终TupleTag readings2000plus=新TupleTag（）{
};
最终TupleTag readingsOld=新TupleTag（）{
};
PCollectionTuple collectionTuple=weatherData.apply（ParDo.named（“tablerow2string”）
.带输出标签（Reading2010，TupleTagList.of（Reading2000Plus.）和（readingsOld））
。of（新DoFn（）{
@凌驾
public void processElement（DoFn.ProcessContext c）引发异常{
如果（c.element（）.getF（）.get（2.getV（）.equals（“2010”））{
c、 输出（c.element（）.toString（））；
}else if（Integer.parseInt（c.element（）.getF（）.get（2.getV（）.toString（））>2000）{
c、 sideOutput（readings2000plus，c.element（）.toString（））；
}否则{
c、 sideOutput（readingsOld，c.element（）.toString（））；
}
}
}));
collectionTuple.get（Reading2010）
.apply（publisubio.Write.named（“WriteToPubsub1”）.topic（“projects/fh dataflow/topics/bq2subub-topic1”）；
collectionTuple.get（readings2000plus）
.apply（publisubio.Write.named（“WriteToPubsub2”）.topic（“projects/fh dataflow/topics/bq2subsubsub-topic2”）；
collectionTuple.get（readingsOld）
.apply（publisubio.Write.named（“WriteToPubsub3”）.topic（“projects/fh dataflow/topics/bq2subub-topic3”）；
p、 run（）；

btw，我试过..但是我的数据对于这种方法来说有点大，大查询最大化并给出了一个错误。我想您应该考虑使用BigQuery对表进行分区（假设它支持这一点）。它只支持按

日期进行分区。

。没有几个问题：行的大小可以有多大，以及您期望的不同值有多少？平均而言，我认为我的表的行最多可以达到~50KB（其中大多数行小于50KB），并且有数百万行。每行有20列，

country

是其中的一列。为什么不呢？看，我有不同的结果。例如，我在公共数据集中运行此命令：

create table`temp.sample\u clustered\u table`按日期划分（picku\u datetime）按费率划分\u代码、付款类型选项（require\u partition\u filter=true）作为select*从`nyc tlc.green.trips\u 2015`

创建一个聚集表。然而，当我查询

temp.sample\u clustered\u table

时，它只会降低分区过滤的成本，如果我按

rate\u code

进行过滤，则不会进一步降低成本。不知道为什么..你可以试试..也许你没有使用前缀过滤？如果您使用诸如“%xyz”之类的

rate\u代码，它将不起作用。它必须是前缀或精确。不，我没有使用通配符搜索。。我正在做从temp.sample\u clustered\u表中选择*，其中日期（picku\u datetime）='2015-03-01'，费率\u code=5但似乎无论我改成什么样的rate\u code
都无助于降低成本。需要注意的是，在查询中切换rate\u code
时，它应该改变成本（因为一些rate\u code在数据集中出现的次数很少），它不会改变估计成本，只会改变结果成本。BigQuery无法在考虑集群的情况下估算成本。您是否实际运行查询来比较成本？
#standardSQL
 CREATE TABLE mydataset.myclusteredtable
 PARTITION BY dateCol
 CLUSTER BY country
 OPTIONS (
   description="a table clustered by country"
 ) AS (
   SELECT ....
 )

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());

PCollection<TableRow> weatherData = p.apply(
        BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));

final TupleTag<String> readings2010 = new TupleTag<String>() {
};
final TupleTag<String> readings2000plus = new TupleTag<String>() {
};
final TupleTag<String> readingsOld = new TupleTag<String>() {
};

PCollectionTuple collectionTuple = weatherData.apply(ParDo.named("tablerow2string")
        .withOutputTags(readings2010, TupleTagList.of(readings2000plus).and(readingsOld))
        .of(new DoFn<TableRow, String>() {
            @Override
            public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {

                if (c.element().getF().get(2).getV().equals("2010")) {
                    c.output(c.element().toString());
                } else if (Integer.parseInt(c.element().getF().get(2).getV().toString()) > 2000) {
                    c.sideOutput(readings2000plus, c.element().toString());
                } else {
                    c.sideOutput(readingsOld, c.element().toString());
                }

            }
        }));
collectionTuple.get(readings2010)
        .apply(PubsubIO.Write.named("WriteToPubsub1").topic("projects/fh-dataflow/topics/bq2pubsub-topic1"));
collectionTuple.get(readings2000plus)
        .apply(PubsubIO.Write.named("WriteToPubsub2").topic("projects/fh-dataflow/topics/bq2pubsub-topic2"));
collectionTuple.get(readingsOld)
        .apply(PubsubIO.Write.named("WriteToPubsub3").topic("projects/fh-dataflow/topics/bq2pubsub-topic3"));

p.run();