Google cloud platform 将记录写入Google数据存储时，Google数据流模板作业未缩放_Google Cloud Platform_Google Cloud Datastore_Performance Testing_Scalability_Throughput

Google cloud platform 将记录写入Google数据存储时，Google数据流模板作业未缩放

google-cloud-platform

Google cloud platform 将记录写入Google数据存储时，Google数据流模板作业未缩放,google-cloud-platform,google-cloud-datastore,performance-testing,scalability,throughput,Google Cloud Platform,Google Cloud Datastore,Performance Testing,Scalability,Throughput,我有一个使用数据流模板从云函数触发的小数据流作业。该作业基本上从Bigquery中的表中读取数据，将生成的Tablerow转换为键值，然后将键值写入数据存储这就是我的代码的样子：- PCollection<TableRow> bigqueryResult = p.apply("BigQueryRead", BigQueryIO.readTableRows().withTemplateCompatibility()

我有一个使用数据流模板从云函数触发的小数据流作业。该作业基本上从Bigquery中的表中读取数据，将生成的Tablerow转换为键值，然后将键值写入数据存储

这就是我的代码的样子：-

PCollection<TableRow> bigqueryResult = p.apply("BigQueryRead",
                BigQueryIO.readTableRows().withTemplateCompatibility()
                        .fromQuery(options.getQuery()).usingStandardSql()
                        .withoutValidation());

bigqueryResult.apply("WriteFromBigqueryToDatastore", ParDo.of(new DoFn<TableRow, String>() {                
            @ProcessElement
            public void processElement(ProcessContext pc) {
                TableRow row = pc.element();

                Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
                KeyFactory keyFactoryCounts = datastore.newKeyFactory().setNamespace("MyNamespace")
                        .setKind("MyKind");

                Key key = keyFactoryCounts.newKey("Key");
                Builder builder =   Entity.newBuilder(key);
                builder.set("Key", BooleanValue.newBuilder("Value").setExcludeFromIndexes(true).build());   

                Entity entity= builder.build();
                datastore.put(entity);
            }
        }));

当我尝试处理的记录数在1到100之间时，此管道运行良好。但是，当我尝试在管道上施加更多负载时，即约10000条记录，即使autoscaling设置为基于吞吐量，并且对于n1-standard-1机器类型，maximumWorkers指定为高达50条，管道也不会扩展。该作业保持每秒由一个或两个工人处理3或4个元素。这会影响我的系统的性能

任何关于如何提高性能的建议都是非常受欢迎的。提前感谢。

至少使用python的ndb客户端库，一次最多可以在一个.put\u多数据存储调用中写入500个实体-比调用快得多。每次调用一个实体时，调用会阻塞底层RPC

我不是java用户，但类似的技术似乎也适用于它。发件人：

如果要对多个对象进行操作，可以使用批处理操作单个云数据存储调用中的实体

以下是批处理调用的示例：

Entity employee1 = new Entity("Employee");
Entity employee2 = new Entity("Employee");
Entity employee3 = new Entity("Employee");
// ...

List<Entity> employees = Arrays.asList(employee1, employee2, employee3);
datastore.put(employees);

通过使用DatastoreIO而不是datastore客户端找到了解决方案。下面是我使用的代码片段

    PCollection<TableRow> row = p.apply("BigQueryRead",
                BigQueryIO.readTableRows().withTemplateCompatibility()
                        .fromQuery(options.getQueryForSegmentedUsers()).usingStandardSql()
                        .withoutValidation());          

    PCollection<com.google.datastore.v1.Entity> userEntity = row.apply("ConvertTablerowToEntity", ParDo.of(new DoFn<TableRow, com.google.datastore.v1.Entity>() {

        @SuppressWarnings("deprecation")
        @ProcessElement
        public void processElement(ProcessContext pc) {
            final String namespace = "MyNamespace";
            final String kind = "MyKind";

            com.google.datastore.v1.Key.Builder keyBuilder = DatastoreHelper.makeKey(kind, "root");
            if (namespace != null) {
              keyBuilder.getPartitionIdBuilder().setNamespaceId(namespace);
            }
            final com.google.datastore.v1.Key ancestorKey = keyBuilder.build();

            TableRow row = pc.element();
            String entityProperty = "sample";

            String key = "key";

            com.google.datastore.v1.Entity.Builder entityBuilder = com.google.datastore.v1.Entity.newBuilder();
            com.google.datastore.v1.Key.Builder keyBuilder1 = DatastoreHelper.makeKey(ancestorKey, kind, key);
            if (namespace != null) {
                keyBuilder1.getPartitionIdBuilder().setNamespaceId(namespace);
              }

              entityBuilder.setKey(keyBuilder1.build());
              entityBuilder.getMutableProperties().put(entityProperty, DatastoreHelper.makeValue("sampleValue").build());
              pc.output(entityBuilder.build());             
        }

    }));

    userEntity.apply("WriteToDatastore", DatastoreIO.v1().write().withProjectId(options.getProject()));

此解决方案能够从每秒3个元素（使用1个工作线程）扩展到每秒1500个元素（使用20个工作线程）。

Hi Dan，感谢您的建议。然而，我所寻找的是一种扩大管道规模以使用更多资源的方法。将实体分组到一个单独的datastore.put操作似乎是一个可能的解决方案。从根本上说，可伸缩性限制来自序列化的bq结果处理。如果你能在并行任务中分割这些任务，你的规模会更好。您可以尝试，对于每个结果/行或最多500个结果/行的簇，只将推送任务排队，并将这些块写入数据存储。但我不确定创建任务并向其传递必要的信息是否比仅仅将数据传递到数据存储要快得多。看起来问题更多的是数据存储的使用，而不是bigquery，尽管您所说的序列化bq结果处理带来的可伸缩性限制是有道理的。使用dataflow DatastoreIO对我来说比使用datastore客户端更有效。我不知道为什么客户无法扩展。