Java 在Apache Beam中使用BigQuery处理空PCollection_Java_Google Cloud Dataflow_Apache Beam

Java 在Apache Beam中使用BigQuery处理空PCollection

java google-cloud-dataflow

Java 在Apache Beam中使用BigQuery处理空PCollection,java,google-cloud-dataflow,apache-beam,Java,Google Cloud Dataflow,Apache Beam,使用以下代码，我在尝试写入BigQuery时遇到以下错误我正在使用ApacheBeam2.0.0 线程“main”org.apache.beam.sdk.Pipeline$PipelineExecutionException中的异常：java.lang.NullPointerException 如果我将text.startsWith更改为D，则一切正常（即输出某些内容）是否有办法捕获或监视空的PCollection 根据StackTrace，看起来错误实际上出现在BigQueryIO中——我

使用以下代码，我在尝试写入BigQuery时遇到以下错误

我正在使用ApacheBeam2.0.0

线程“main”org.apache.beam.sdk.Pipeline$PipelineExecutionException中的异常：java.lang.NullPointerException

如果我将

text.startsWith

更改为

，则一切正常（即输出某些内容）

是否有办法捕获或监视空的PCollection

根据StackTrace，看起来错误实际上出现在BigQueryIO中——我的存储桶中的文件有0个字节，这可能是导致BigQueryIO出现问题的原因

我的用例是，我对死信使用了side输出，当我的工作没有产生死信输出时遇到了这个错误，所以稳健地处理它会很有用

作业应该能够在批处理或流式处理模式下运行，我的最佳猜测是在批处理模式下将任何输出写入GCS/TextIO，并在流式处理时写入GBQ，如果这听起来合理的话

感谢您的帮助

public class EmptyPCollection {

public static void main(String [] args) {

    PipelineOptions options = PipelineOptionsFactory.create();
    options.setTempLocation("gs://<your-bucket-here>/temp");
    Pipeline pipeline = Pipeline.create(options);
    String schema = "{\"fields\": [{\"name\": \"pet\", \"type\": \"string\", \"mode\": \"required\"}]}";
    String table = "<your-dataset>.<your-table>";
    List<String> pets = Arrays.asList("Dog", "Cat", "Goldfish");
    PCollection<String> inputText = pipeline.apply(Create.of(pets)).setCoder(StringUtf8Coder.of());
    PCollection<TableRow> rows = inputText.apply(ParDo.of(new DoFn<String, TableRow>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
            String text = c.element();
            if (text.startsWith("X")) {  // change to (D)og and works fine
                TableRow row = new TableRow();
                row.set("pet", text);
                c.output(row);
            }
        }
    }));

    rows.apply(BigQueryIO.writeTableRows().to(table).withJsonSchema(schema)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));

    pipeline.run().waitUntilFinish();

}

公共类EmptyPCollection{
公共静态void main（字符串[]args）{
PipelineOptions=PipelineOptionsFactory.create（）；
options.setTempLocation（“gs:///temp”）；
Pipeline=Pipeline.create（选项）；
字符串模式=“{\”字段\“：[{\”名称\“：\”宠物\“，\”类型\“：\”字符串\“，\”模式\“：\”必需\“}]}”；
字符串表=“.”；
列出宠物=数组。asList（“狗”、“猫”、“金鱼”）；
PCollection inputText=pipeline.apply（Create.of（pets））.setCoder（StringUtf8Coder.of（））；
PCollection行=inputText.apply（ParDo.of（new DoFn）（）{
@过程元素
公共void processElement（ProcessContext c）{
字符串text=c.element（）；
如果（text.startsWith（“X”）{//更改为（D）og，工作正常
TableRow行=新TableRow（）；
行集（“pet”，文本）；
c、 输出（行）；
}
}
}));
rows.apply（BigQueryIO.writeTableRows（）.to（table）.withJsonSchema（模式）
.withWriteDisposition（BigQueryIO.Write.WriteDisposition.Write\u追加）
.withCreateDisposition（BigQueryIO.Write.CreateDisposition.CREATE如果需要）；
pipeline.run（）.waitUntilFinish（）；
}

}

[direct runner worker]INFO org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter-向gs打开TableRowWriter:///temp/BigQueryWriteTemp/05c7a7c0786a4656abad97f11ef23d8e/2675e1c7-f4d7-4f78-a85f-a38095b57e6b。
线程“main”org.apache.beam.sdk.Pipeline$PipelineExecutionException中的异常：java.lang.NullPointerException
位于org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish（DirectRunner.java:322）
位于org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish（DirectRunner.java:292）
位于org.apache.beam.runners.direct.DirectRunner.run（DirectRunner.java:200）
位于org.apache.beam.runners.direct.DirectRunner.run（DirectRunner.java:63）
位于org.apache.beam.sdk.Pipeline.run（Pipeline.java:295）
位于org.apache.beam.sdk.Pipeline.run（Pipeline.java:281）
位于EmptyPCollection.main（EmptyPCollection.java:54）
原因：java.lang.NullPointerException
位于org.apache.beam.sdk.io.gcp.bigquery.WriteTables.processElement（WriteTables.java:97）

这看起来像是Apache Beam中BigQuery接收器实现中的一个bug。在apachebeam Jira中归档一个bug将是归档这个bug的合适位置

我已提交文件跟踪此问题。

这看起来像是Apache Beam中BigQuery接收器实现中的一个bug。在apachebeam Jira中归档一个bug将是归档这个bug的合适位置

我已提交文件以跟踪此问题。

看起来与我不久前遇到的错误相同：。也许值得在这里提一下：当然看起来是一样的。非常感谢-至少我不会发疯！将研究提出一个问题，并对此给予更多关注，从长远来看，这可能会使jobs陷入困境。看起来与我不久前遇到的bug相同：。在这里可能值得一提：当然看起来也一样。非常感谢-至少我不会发疯！我将研究提出一个问题，并对此给予更多关注，从长远来看，这可能会让工作陷入困境。

[direct-runner-worker] INFO org.apache.beam.sdk.io.gcp.bigquery.TableRowWriter - Opening TableRowWriter to gs://<your-bucket>/temp/BigQueryWriteTemp/05c7a7c0786a4656abad97f11ef23d8e/2675e1c7-f4d7-4f78-a85f-a38095b57e6b.

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)
at EmptyPCollection.main(EmptyPCollection.java:54)
Caused by: java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.WriteTables.processElement(WriteTables.java:97)