Google cloud dataflow TextIO.read（）.watchForNewFiles（）防止写入BigQuery_Google Cloud Dataflow_Apache Beam

Google cloud dataflow TextIO.read（）.watchForNewFiles（）防止写入BigQuery

google-cloud-dataflow

Google cloud dataflow TextIO.read（）.watchForNewFiles（）防止写入BigQuery,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我正在尝试创建一个管道，等待GCS文件夹中的新csv文件处理它们，并将输出写入BigQuery 我编写了以下代码： publicstaticvoidmain（字符串[]args）{ Pipeline p=Pipeline.create（PipelineOptionsFactory.fromArgs（args）.withValidation（）.as（Options.class））； TableReference tableRef=新的TableReference（）； tableRef.setP

我正在尝试创建一个管道，等待GCS文件夹中的新csv文件处理它们，并将输出写入BigQuery

我编写了以下代码：

publicstaticvoidmain（字符串[]args）{
Pipeline p=Pipeline.create（PipelineOptionsFactory.fromArgs（args）.withValidation（）.as（Options.class））；
TableReference tableRef=新的TableReference（）；
tableRef.setProjectId（项目ID）；
tableRef.setDatasetId（数据集ID）；
tableRef.setTableId（TABLE_ID）；
//Pipeline p=Pipeline.create（PipelineOptionsFactory.as（Options.class））；
//在文件到达GS时读取文件
p、 apply（“ReadFile”，TextIO.read（）
.from（“gs://mybucket/*.csv”）
.watchForNewFiles(
//每30秒检查一次新文件
持续时间。标准秒（30），
//永远不要停止检查新文件
观察，成长，永不
)
)
.适用（新DoFn（）的第{
@过程元素
公共void processElement（ProcessContext c）{
String[]items=c.element（）.split（“，”）；
if（项[0]。开始使用（“25;”，1））{
//跳过标题（标题以_注释开头）
LOG.info（“跳过的标题”）；
回来
}
分段=新分段（项目）；
c、 输出（段）；
}
}))
.apply（ParDo.of（new FormatSegment（）））
.apply（BigQueryIO.writeTableRows（）
.至（表参考）
.withSchema（FormatSegment.getSchema（））
.withWriteDisposition（BigQueryIO.Write.WriteDisposition.Write\u追加）
.withCreateDisposition（BigQueryIO.Write.CreateDisposition.CREATE如果需要）；
//运行管道。
p、 run（）；
}

如果我删除了

watchForNewFiles

部分，我的代码运行得很好（我看到了关于并行写入GCS临时位置的信息日志，最终输出写入BigQuery）

但是如果我让

监视新文件

（上面的代码），那么我只会看到一个信息日志（关于写入GCS临时位置），执行就会受阻。BigQuery中没有更多日志、错误和输出

有什么想法吗？

使用

waitForNewFiles（）

时，我们必须使用

BigQueryIO.write.Method.STREAMING\u insert

方法写入BigQuery

现在有效的代码如下所示：

.apply(BigQueryIO.writeTableRows()
        .to(tableRef)
        .withSchema(FormatSegment.getSchema())
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

使用

waitForNewFiles（）

时，我们必须使用

BigQueryIO.write.Method.STREAMING\u insert

方法写入BigQuery

现在有效的代码如下所示：

.apply(BigQueryIO.writeTableRows()
        .to(tableRef)
        .withSchema(FormatSegment.getSchema())
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

使用DataflowRunner时，我在尝试使用时遇到此错误。。 java.lang.UnsupportedOperationException:DataflowRunner当前不支持可拆分DoFn:org.apache.beam.sdk.transforms.Watch$WatchGrowthFn@4a1691ac

使用direct runner，我看到它进行轮询，但管道的其余部分似乎并没有启动，也并没有错误。写入数据存储和bigquery。

作为测试，您是否尝试写入bigquery以外的其他输出源？这将确认问题是否确实与BigQuery有关。作为测试，您是否尝试写入BigQuery以外的其他输出源？这将确认问题是否确实与BigQuery相关。