Google cloud dataflow 云数据流自定义模板创建问题
我正在尝试为云数据流作业创建一个模板,该模板从云存储读取json文件并写入大查询。我正在传递2个运行时参数:1。地面军事系统位置2的输入文件。BigQuery的数据集和表Id JsonTextToBqTemplate代码:Google cloud dataflow 云数据流自定义模板创建问题,google-cloud-dataflow,Google Cloud Dataflow,我正在尝试为云数据流作业创建一个模板,该模板从云存储读取json文件并写入大查询。我正在传递2个运行时参数:1。地面军事系统位置2的输入文件。BigQuery的数据集和表Id JsonTextToBqTemplate代码: public class JsonTextToBqTemplate { private static final Logger logger = LoggerFactory.getLogger(TextToBQTemplate.class);
public class JsonTextToBqTemplate {
private static final Logger logger =
LoggerFactory.getLogger(TextToBQTemplate.class);
private static Gson gson = new GsonBuilder().create();
public static void main(String[] args) throws Exception {
JsonToBQTemplateOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation()
.as(JsonToBQTemplateOptions.class);
String jobName = options.getJobName();
try {
logger.info("PIPELINE-INFO: jobName={} message={} ",
jobName, "starting pipeline creation");
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply("Converting to TableRows", ParDo.of(new DoFn<String, TableRow>() {
private static final long serialVersionUID = 0;
@ProcessElement
public void processElement(ProcessContext c) {
String json = c.element();
TableRow tableRow = gson.fromJson(json, TableRow.class);
c.output(tableRow);
}
}))
.apply(BigQueryIO.writeTableRows().to(options.getTableSpec())
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
logger.info("PIPELINE-INFO: jobName={} message={} ", jobName, "pipeline started");
State state = pipeline.run().waitUntilFinish();
logger.info("PIPELINE-INFO: jobName={} message={} ", jobName, "pipeline status" + state);
} catch (Exception exception) {
throw exception;
}
}
}
错误:
Caused by: java.lang.IllegalStateException: Cannot estimate size of a FileBasedSource with inaccessible file pattern: {}. [RuntimeValueProvider{propertyName=inputFile, default=null, value=null}]
at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:518)
at org.apache.beam.sdk.io.FileBasedSource.getEstimatedSizeBytes(FileBasedSource.java:199)
at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:207)
at org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87)
at org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62)
当我传递inputFile和tableSpec的值时,Mvn构建成功,如下所示
mvn -X compile exec:java \
-Dexec.mainClass=com.ihm.adp.pipeline.template.JsonTextToBqTemplate \
-Dexec.args="--project=xxxxxx-123456 \
--stagingLocation=gs://xxx-test/template/staging/jsontobq/ \
--tempLocation=gs://xxx-test/temp/ \
--templateLocation=gs://xxx-test/template/templates/jsontobq \
--inputFile=gs://xxx-test/input/bqtest.json \
--tableSpec=xxx_test.jsontobq_test \
--errorOutput=gs://xxx-test/template/output"
但它不会在云数据流中创建任何模板
有没有一种方法可以在maven执行期间创建模板而不验证这些运行时参数?我认为这里的问题是您没有指定运行程序。默认情况下,这是尝试使用DirectRunner。设法通过
--runner=TemplatingDataflowPipelineRunner
作为-Dexec.args
的一部分。在此之后,您也不需要指定诸如inputFile等ValueProvider模板参数
更多信息请点击此处:
如果您使用的是Dataflow SDK 1.x版,则需要指定以下参数:
--runner=TemplatingDataflowPipelineRunner
--dataflowJobFile=gs://xxx-test/template/templates/jsontobq/
--runner=DataflowRunner
--templateLocation=gs://xxx-test/template/templates/jsontobq/
如果您使用的是Dataflow SDK版本2.x(Apache Beam),则需要指定以下参数:
--runner=TemplatingDataflowPipelineRunner
--dataflowJobFile=gs://xxx-test/template/templates/jsontobq/
--runner=DataflowRunner
--templateLocation=gs://xxx-test/template/templates/jsontobq/
看起来您使用的是Dataflow SDK版本2.x,而没有为runner
参数指定DataflowRunner
参考资料:谢谢安德鲁!模板创建成功,工作正常。我的Mvn构建仍然失败,出现以下错误。原因:org.apache.beam.runners.dataflow.DataflowPipelineJob.getJobWithRetries上的java.lang.NullPointerException(DataflowPipelineJob.java:489)。似乎它正在尝试运行管道作业,但由于空引用而失败。您好@prasad您是如何运行创建命令的?从项目主文件夹执行mvn编译命令时,我收到NoClassDefFoundError。