Java 获取由作业创建的BigQuery临时表以更快读取大数据的最佳方法_Java_Google Cloud Platform_Google Bigquery

Java 获取由作业创建的BigQuery临时表以更快读取大数据的最佳方法

java google-cloud-platform google-bigquery

Java 获取由作业创建的BigQuery临时表以更快读取大数据的最佳方法,java,google-cloud-platform,google-bigquery,Java,Google Cloud Platform,Google Bigquery,我试图使用BigQuery中的Java客户机库对表执行查询。我创建了一个作业，然后使用Job.getQueryResults（）.iterateAll（）方法获得作业的结果这种方法是可行的，但对于像600k这样的大数据，它需要大约80-120秒的时间。我看到BigQuery以40-45k个批次获取数据，每个批次大约需要5-7秒我想更快地得到结果，我在互联网上发现，如果我们可以从作业中获得BigQuery创建的临时表，并从该表中读取avro或其他格式的数据，那么速度会非常快，但在BigQuer

我试图使用BigQuery中的Java客户机库对表执行查询。我创建了一个作业，然后使用Job.getQueryResults（）.iterateAll（）方法获得作业的结果
这种方法是可行的，但对于像600k这样的大数据，它需要大约80-120秒的时间。我看到BigQuery以40-45k个批次获取数据，每个批次大约需要5-7秒
我想更快地得到结果，我在互联网上发现，如果我们可以从作业中获得BigQuery创建的临时表，并从该表中读取avro或其他格式的数据，那么速度会非常快，但在BigQuery API中（使用版本：1.124.7），我看不到这种方法。是否有人知道如何在Java中实现这一点，或者如何在有大量记录的情况下更快地获取数据。感谢您的帮助
读取表格的代码（需要20秒）
使用查询读取同一表的代码（需要90秒）

我尝试了一些方法，并根据这些方法找到了最好的方法，只是想在这里发表文章来帮助一些人
1：如果我们在作业上或直接在表上使用job.getQueryResults（）.iterateAll（），则需要相同的时间。因此，如果我们不给出批量大小，BigQuery将使用大约35-45k的批量大小来获取数据。因此，对于600k行（180Mb），需要70-100秒
2：我们可以使用创建作业的临时表详细信息，并使用表的提取作业功能将结果写入GCS，这将更快，大约需要30-35秒。这种方法不会在本地下载，因为我们需要在temp表上再次使用..iterateAll（），它将花费与1相同的时间
示例伪代码：

try { Job job = getBigQueryClient().getJob(JobId.of(jobId)); long start = System.currentTimeMillis(); // FieldList list = getFields(job); Job completedJob = job.waitFor( RetryOption.initialRetryDelay(Duration.ofSeconds(1)), RetryOption.totalTimeout(Duration.ofMinutes(3))); if (completedJob != null && completedJob.getStatus().getError() == null) { log.info("job done"); String gcsUrl = "gs://bucketname/test"; //getting the temp table information of the Job TableId destinationTableInfo = ((QueryJobConfiguration) job.getConfiguration()).getDestinationTable(); log.info("Total time taken in getting schema ::{}", (System.currentTimeMillis() - start)); Table table = bigQueryHelper.getBigQueryClient().getTable(destinationTableInfo); //Using extract job to write the data in GCS Job newJob1 = table.extract( CsvOptions.newBuilder().setFieldDelimiter("\t").build().toString(), gcsUrl); System.out.println("DestinationInfo::" + destinationTableInfo); Job completedJob1 = newJob1.waitFor( RetryOption.initialRetryDelay(Duration.ofSeconds(1)), RetryOption.totalTimeout(Duration.ofMinutes(3))); if (completedJob1 != null && completedJob1.getStatus().getError() == null) { log.info("job done"); } else { log.info("job has error"); } } else { log.info("job has error"); } } catch (InterruptedException e) { e.printStackTrace(); }
3：这是我想要的最好的方式。它可以更快地在本地文件中下载/写入结果。它大约在20秒内下载数据。这是BigQuery提供的新方式，可以使用以下链接进行检查：

列表项

我可以请您发送一个链接到“……我在互联网上发现，如果我们能够从作业中获取BigQuery创建的临时表，并从该表中读取avro或其他格式的数据，那么速度将非常快。”“，请？不要将读取已创建的表与执行创建表的查询然后读取该表混淆。总聚合时间将是相似的。@JohnHanley实际上我看到这两种方法在时间上有所不同。在我的描述中添加的代码读取同一个表（需要20秒），如果我用query读取它需要90秒的时间。@KyryloBulat这个链接是这样说的，我又找到了一个这样的链接，但这些都是旧的API，不是新的
Job job = bigQueryHelper.getBigQueryClient().getJob(JobId.of(jobId)); for (FieldValueList row : job.getQueryResults().iterateAll()) { System.out.println(row); }

try { Job job = getBigQueryClient().getJob(JobId.of(jobId)); long start = System.currentTimeMillis(); // FieldList list = getFields(job); Job completedJob = job.waitFor( RetryOption.initialRetryDelay(Duration.ofSeconds(1)), RetryOption.totalTimeout(Duration.ofMinutes(3))); if (completedJob != null && completedJob.getStatus().getError() == null) { log.info("job done"); String gcsUrl = "gs://bucketname/test"; //getting the temp table information of the Job TableId destinationTableInfo = ((QueryJobConfiguration) job.getConfiguration()).getDestinationTable(); log.info("Total time taken in getting schema ::{}", (System.currentTimeMillis() - start)); Table table = bigQueryHelper.getBigQueryClient().getTable(destinationTableInfo); //Using extract job to write the data in GCS Job newJob1 = table.extract( CsvOptions.newBuilder().setFieldDelimiter("\t").build().toString(), gcsUrl); System.out.println("DestinationInfo::" + destinationTableInfo); Job completedJob1 = newJob1.waitFor( RetryOption.initialRetryDelay(Duration.ofSeconds(1)), RetryOption.totalTimeout(Duration.ofMinutes(3))); if (completedJob1 != null && completedJob1.getStatus().getError() == null) { log.info("job done"); } else { log.info("job has error"); } } else { log.info("job has error"); } } catch (InterruptedException e) { e.printStackTrace(); }