管道外的Java代码赢得'；不能在数据流上运行_Java_Google Cloud Dataflow_Apache Beam

管道外的Java代码赢得'；不能在数据流上运行

java google-cloud-dataflow

管道外的Java代码赢得'；不能在数据流上运行,java,google-cloud-dataflow,apache-beam,Java,Google Cloud Dataflow,Apache Beam,看起来管道之外的任何代码都不会在数据流上运行。在下面的示例中，我在TableRowConverterFn.processElement方法中为TableSchema获取了一个NullPointerException。使用ApacheBeam/Dataflow执行此操作的正确方法是什么 private static TableSchema TableSchema; public static void main(String[] args) { try {

看起来管道之外的任何代码都不会在数据流上运行。在下面的示例中，我在

TableRowConverterFn.processElement

方法中为

TableSchema

获取了一个

NullPointerException

。使用ApacheBeam/Dataflow执行此操作的正确方法是什么

     private static TableSchema TableSchema;

     public static void main(String[] args) {

        try {
            TableSchema = TableSchemaReader.read(TableSchemaResource);
        } catch (IOException e) {
            log.error("Table schema can not be read from {}. Process aborted.", TableSchemaResource);
            return;
        }

        DataflowDfpOptions options = PipelineOptionsFactory.fromArgs(args)
                //.withValidation()
                .as(DataflowDfpOptions.class);

        Pipeline pipeline = Pipeline.create(options);

        Stopwatch sw = Stopwatch.createStarted();
        log.info("DFP data transfer from GS to BQ has started.");

        pipeline.apply("ReadFromStorage", TextIO.read()
                .from("gs://my-test/stream/*.gz")
                .withCompression(Compression.GZIP))
                .apply("TransformToTableRow", ParDo.of(new TableRowConverterFn()))
                .apply("WriteToBigQuery", BigQueryIO.writeTableRows()
                        .to(options.getTableId())
                        .withMethod(STREAMING_INSERTS)
                        .withCreateDisposition(CREATE_NEVER)
                        .withWriteDisposition(WRITE_APPEND)
                        .withSchema(TableSchema)); //todo: use withJsonScheme(String json) method instead


        pipeline.run().waitUntilFinish();

        log.info("DFP data transfer from GS to BQ is finished in {} seconds.", sw.elapsed(TimeUnit.SECONDS));
    }

    /**
     * Creates a TableRow from a CSV line
     */
    private static class TableRowConverterFn extends DoFn<String, TableRow> {

        @ProcessElement
        public void processElement(ProcessContext c) throws Exception {

            String[] split = c.element().split(",");

            //Ignore the header line
            //Since this is going to be run in parallel, we can't guarantee that the first line passed to this method will be the header
            if (split[0].equals("Time")) {
                log.info("Skipped header");
                return;
            }

            TableRow row = new TableRow();
            for (int i = 0; i < split.length; i++) {

                //This throws NEP!!!
                TableFieldSchema col = TableSchema.getFields().get(i);

                //String is the most common type, putting it in the first if clause for a little bit optimization.
                if (col.getType().equals("STRING")) {
                    row.set(col.getName(), split[i]);
                } else if (col.getType().equals("INTEGER")) {
                    row.set(col.getName(), Long.valueOf(split[i]));
                } else if (col.getType().equals("BOOLEAN")) {
                    row.set(col.getName(), Boolean.valueOf(split[i]));
                } else if (col.getType().equals("FLOAT")) {
                    row.set(col.getName(), Float.valueOf(split[i]));
                } else {
                    //Simply try to write it as a String if
                    //todo: Consider other BQ data types.
                    row.set(col.getName(), split[i]);
                }
            }
            c.output(row);
        }
    }

私有静态表模式表模式；
公共静态void main（字符串[]args）{
试一试{
TableSchema=TableSchemaReader.read（TableSchemaResource）；
}捕获（IOE异常）{
log.error（“无法从{}读取表架构。进程已中止。”，TableSchemaResource）；
返回；
}
DataFlowdPoptions options=PipelineOptionsFactory.fromArgs（args）
//.withValidation（）
.as（dataflowdpoptions.class）；
Pipeline=Pipeline.create（选项）；
Stopwatch sw=Stopwatch.createStarted（）；
log.info（“已开始从GS到BQ的DFP数据传输”）；
apply（“ReadFromStorage”，TextIO.read（）
.from（“gs://my test/stream/*.gz”）
.withCompression（Compression.GZIP））
.apply（“TransformToTableRow”，ParDo.of（new TableRowConverterFn（）））
.apply（“WriteToBigQuery”，BigQueryIO.writeTableRows（）
.to（options.getTableId（））
.withMethod（流式处理插入）
.withCreateDisposition（从不创建）
.带writedisposition（WRITE_APPEND）
.withSchema（TableSchema））；//todo:改用withJsonScheme（String json）方法
pipeline.run（）.waitUntilFinish（）；
log.info（“从GS到BQ的DFP数据传输在{}秒内完成。”，sw.passed（TimeUnit.seconds））；
}
/**
*从CSV行创建TableRow
*/
私有静态类TableRowConverterFn扩展了DoFn{
@过程元素
public void processElement（ProcessContext c）引发异常{
字符串[]split=c.element（）.split（“，”）；
//忽略标题行
//因为这是并行运行的，所以我们不能保证传递给这个方法的第一行是头
如果（拆分[0]。等于（“时间”））{
log.info（“跳过的标题”）；
返回；
}
TableRow行=新TableRow（）；
对于（int i=0；i

尽管此代码可能在DirectRunner中本地工作，但在DataflowRunner中确实无法工作。原因如下：

在

main

函数之外创建的DOFN不可以通过DataflowRunner访问类的（甚至是静态）变量。我相信这是由于数据流在云中运行时是如何对dofn进行分级和序列化的（尽管不是100%确定）
以下是您可以克服此问题的方法：

私有静态类TableRowConverterFn扩展了DoFn{ 私有静态表模式表模式； public TableRowConverterFn（TableSchema TableSchema）{ this.tableSchema=tableSchema； } @过程元素 public void processElement（ProcessContext c）引发异常{ //东西 } }
然后在主函数调用中

.apply("TransformToTableRow", ParDo.of(new TableRowConverterFn(tableSchema)));

尽管此代码可能在DirectRunner中本地工作，但它确实无法在DataflowRunner中工作。原因如下：
在
main
函数之外创建的DOFN不可以通过DataflowRunner访问类的（甚至是静态）变量。我相信这是由于数据流在云中运行时是如何对dofn进行分级和序列化的（尽管不是100%确定）
以下是您可以克服此问题的方法：

私有静态类TableRowConverterFn扩展了DoFn{ 私有静态表模式表模式； public TableRowConverterFn（TableSchema TableSchema）{ this.tableSchema=tableSchema； } @过程元素 public void processElement（ProcessContext c）引发异常{ //东西 } }
然后在主函数调用中

.apply("TransformToTableRow", ParDo.of(new TableRowConverterFn(tableSchema)));

提示：添加
Java
标记以自动突出显示代码。至于你得到的错误，如果我错了，请纠正我，但是你从来没有在
TableRowConverterFn
函数中定义变量
TableSchema
，是吗？@hbartender很棒的技巧！谢谢
TableSchema
是一个静态字段。很抱歉，我忘了将其添加到源代码中。我将其添加到源代码的最顶端。我在Beam中的任何地方都找不到
TableSchemaReader
和
TableSchemaResource
，这些是自定义类吗？@ThehBarTender是的。架构是参考资料中的一个文件
TableSchemaAreader
只需读取它并创建一个
TableSchema
对象。提示：添加
Java
标记以自动突出显示代码。至于你所犯的错误，如果我错了，请纠正我，但你从来没有犯过