Google cloud platform 为什么使用数据流写入Bigquery非常慢？_Google Cloud Platform_Google Bigquery_Google Cloud Dataflow_Apache Beam

Google cloud platform 为什么使用数据流写入Bigquery非常慢？

google-cloud-platform google-bigquery google-cloud-dataflow

Google cloud platform 为什么使用数据流写入Bigquery非常慢？,google-cloud-platform,google-bigquery,google-cloud-dataflow,apache-beam,Google Cloud Platform,Google Bigquery,Google Cloud Dataflow,Apache Beam,我可以以每秒大约10000个插入的速度将插入直接流式传输到BigQuery中，但是当我尝试使用数据流插入时，“ToBqRow”步骤（如下所示）非常慢每10分钟仅50行，这是由4名工人组成的。知道为什么吗？以下是相关代码： PCollection<Status> statuses = p .apply("GetTweets", PubsubIO.readStrings().fromTopic(topic)) .apply("ExtractData", P

我可以以每秒大约10000个插入的速度将插入直接流式传输到BigQuery中，但是当我尝试使用数据流插入时，“ToBqRow”步骤（如下所示）非常慢每10分钟仅50行，这是由4名工人组成的。知道为什么吗？以下是相关代码：

PCollection<Status> statuses = p .apply("GetTweets", PubsubIO.readStrings().fromTopic(topic)) .apply("ExtractData", ParDo.of(new DoFn<String, Status>() { @ProcessElement public void processElement(DoFn<String, Status>.ProcessContext c) throws Exception { String rowJson = c.element(); try { TweetsWriter.LOGGER.debug("ROWJSON = " + rowJson); Status status = TwitterObjectFactory.createStatus(rowJson); if (status == null) { TweetsWriter.LOGGER.error("Status is null"); } else { TweetsWriter.LOGGER.debug("Status value: " + status.getText()); } c.output(status); TweetsWriter.LOGGER.debug("Status: " + status.getId()); } catch (Exception var4) { TweetsWriter.LOGGER.error("Status creation from JSON failed: " + var4.getMessage()); } } })); statuses .apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() { @ProcessElement public void processElement(ProcessContext c) throws Exception { TableRow row = new TableRow(); Status status = c.element(); row.set("Id", status.getId()); row.set("Text", status.getText()); row.set("RetweetCount", status.getRetweetCount()); row.set("FavoriteCount", status.getFavoriteCount()); row.set("Language", status.getLang()); row.set("ReceivedAt", (Object)null); row.set("UserId", status.getUser().getId()); row.set("CountryCode", status.getPlace().getCountryCode()); row.set("Country", status.getPlace().getCountry()); c.output(row); } })) .apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable) .withSchema(schema) .withMethod(Method.STREAMING_INSERTS) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)); p.run();

p采集状态=p .apply（“GetTweets”，PubsubIO.readStrings（）.fromTopic（topic）） .应用（“提取数据”，第页，共页（新DoFn）（）{ @过程元素 public void processElement（DoFn.ProcessContext c）引发异常{ 字符串rowJson=c.element（）；试一试{ TweetsWriter.LOGGER.debug（“ROWJSON=“+ROWJSON”）； Status Status=TwitterObjectFactory.createStatus（rowJson）；如果（状态==null）{ TweetsWriter.LOGGER.error（“状态为空”）； }否则{ 调试（“状态值：”+Status.getText（））； } c、输出（状态）；调试（“状态：“+Status.getId（））； }捕获（异常变量4）{ 错误（“从JSON创建状态失败：“+var4.getMessage（）”）； } } })); 身份 .apply（“ToBQRow”，第页，共页，共页{ @过程元素 public void processElement（ProcessContext c）引发异常{ TableRow行=新TableRow（）；状态=c.元素（）； set（“Id”，status.getId（））； set（“Text”，status.getText（））； set（“RetweetCount”，status.getRetweetCount（））； set（“FavoriteCount”，status.getFavoriteCount（））； set（“语言”，status.getLang（））；行集合（“ReceivedAt”，（Object）null）； set（“UserId”，status.getUser（）.getId（））；行.set（“CountryCode”，status.getPlace（）.getCountryCode（））；行.set（“Country”，status.getPlace（）.getCountry（））； c、输出（行）； } })) .apply（“WriteTableRows”，BigQueryIO.WriteTableRows（）.to（tweetsTable） .withSchema（schema） .withMethod（方法流式插入） .withWriteDisposition（WriteDisposition.WRITE_APPEND） .withCreateDisposition（CreateDisposition.CREATE_，如果需要）； p、 run（）；
结果是数据流下的Bigquery并不慢。问题是，“status.getPlace（）.getCountryCode（）返回NULL，所以它抛出了我在日志中任何地方都看不到的NullPointerException！”！显然，数据流日志记录需要改进。它现在运行得很好。消息一出现在主题中，几乎立即写入BigQuery
您是否对您的
状态执行任何计算繁重的操作？也许您已经陷入了束图融合优化（）中，并且您的多个变换被压缩为单个变换，这可能会导致瓶颈。在ToBQRow 之前尝试重新洗牌。我已经更新了上面的代码。如你所见，我不做任何计算繁重的操作。只需从PubSub主题中读取消息，提取相关信息，创建TableRow对象并编写它。“ToBQRow”似乎是真正的罪魁祸首：输入集合->添加元素->13829。输出集合->添加元素->249。我没有看到任何类型的窗口，这可能是一个问题。我不明白为什么我要使用窗口！我没有聚合数据我在任何情况下都尝试过，但没有帮助：（不确定这是否是正确的用法！）：.apply（“GetTweets”，PubsubIO.readStrings（）.fromTopic（topic））.apply（“TimeWindow”，Window.into（SlidingWindows.of（AveraginInterval）。every（AveraginInterval）））我已经阅读了文档中的Status，但在理解下面发生了什么方面没有太多成功，但我认为问题在于TableRow对象。您能否验证TableRow是否没有填充空数据？其次，您能否验证模式是否与TableRow匹配？如果不匹配，这将解释为什么只映射部分行，因为模式仅与某些行匹配（即，当附加值为null时）。如果你能确认这些不是问题所在，我会继续挖掘我也遇到了写慢的问题。如果错误不在日志中，您是如何发现的？