Google cloud storage 将通过PubSub接收到的每一行写入云存储上自己的文件_Google Cloud Storage_Google Cloud Dataflow

Google cloud storage 将通过PubSub接收到的每一行写入云存储上自己的文件

google-cloud-storage google-cloud-dataflow

Google cloud storage 将通过PubSub接收到的每一行写入云存储上自己的文件,google-cloud-storage,google-cloud-dataflow,Google Cloud Storage,Google Cloud Dataflow,我通过pubsub接收消息。每条消息都应该作为粗略数据存储在GCS中自己的文件中，对数据执行一些处理，然后将其保存到大查询中——数据中有文件名收到数据后，应立即在BQ中查看数据示例： data published to pubsub : {a:1, b:2} data saved to GCS file UUID: A1F432 data processing : {a:1, b:2} -> {a:11, b: 22} ->

我通过pubsub接收消息。每条消息都应该作为粗略数据存储在GCS中自己的文件中，对数据执行一些处理，然后将其保存到大查询中——数据中有文件名

收到数据后，应立即在BQ中查看数据

示例：

data published to pubsub : {a:1, b:2} 
data saved to GCS file UUID: A1F432 
data processing :  {a:1, b:2} -> 
                   {a:11, b: 22} -> 
                   {fileName: A1F432, data: {a:11, b: 22}} 
data in BQ : {fileName: A1F432, data: {a:11, b: 22}}

其想法是，处理后的数据存储在BQ中，并与GCS中存储的粗略数据建立链接

这是我的密码

public class BotPipline {

public static void main(String[] args) {

    DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    options.setRunner(BlockingDataflowPipelineRunner.class);
    options.setProject(MY_PROJECT);
    options.setStagingLocation(MY_STAGING_LOCATION);
    options.setStreaming(true);

    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> input = pipeline.apply(PubsubIO.Read.subscription(MY_SUBSCRIBTION));

    String uuid = ...;
    input.apply(TextIO.Write.to(MY_STORAGE_LOCATION + uuid));

    input
    .apply(ParDo.of(new DoFn<String,String>(){..}).named("updateJsonAndInsertUUID"))
    .apply(convertToTableRow(...)).named("convertJsonStringToTableRow"))
            .apply(BigQueryIO.Write.to(MY_BQ_TABLE).withSchema(tableSchema)
    );
    pipeline.run();
}

公共类BotPipline{
公共静态void main（字符串[]args）{
DataflowPipelineOptions=PipelineOptions工厂.as（DataflowPipelineOptions.class）；
options.setRunner（BlockingDataflowPipelineRunner.class）；
选项.setProject（MY_项目）；
选项。设置标记位置（我的标记位置）；
选项。设置流（true）；
Pipeline=Pipeline.create（选项）；
PCollection input=pipeline.apply（PubsubIO.Read.subscription（MY_subscription））；
字符串uuid=。。。；
input.apply（TextIO.Write.to（MY_STORAGE_LOCATION+uuid））；
输入
.apply（ParDo.of（new DoFn（）{..}）.named（“updatejsonandinsertuid”））
.apply（convertToTableRow（…）。命名为（“convertJsonStringToTableRow”））
.apply（BigQueryIO.Write.to（MY_BQ_TABLE）。使用schema（tableSchema）
);
pipeline.run（）；
}

在TextIO中写入无界集合后，我的代码未运行。不支持写入。经过一些研究，我发现我有一些解决这个问题的方法：

在数据流中创建自定义接收器

作为我自己的DoFn实现对GCS的写入

使用可选的BoundedWindow访问数据窗口

我不知道如何开始。任何人都可以为我提供以下解决方案之一的代码，或者给我一个与我的情况相匹配的不同解决方案。（提供代码）

最好的选择是#2-一个简单的

DoFn

，它根据您的数据创建文件。类似于这样：

class CreateFileFn extends DoFn<String, Void> {
  @ProcessElement
  public void process(ProcessContext c) throws IOException {
    String filename = ...generate filename from element...;
    try (WritableByteChannel channel = FileSystems.create(
            FileSystems.matchNewResource(filename, false),
            "application/text-plain")) {
      OutputStream out = Channels.newOutputStream(channel);
      ...write the element to out...
    }
  }
}

class CreateFileFn扩展了DoFn{
@过程元素
公共void进程（ProcessContext c）引发IOException{
字符串文件名=…从元素生成文件名。。。；
try（WritableByteChannel通道=FileSystems.create(
FileSystems.matchNewResource（文件名，false），
“应用程序/纯文本”））{
OutputStream out=通道。newOutputStream（通道）；
…写入要输出的元素。。。
}
}
}

在哪里添加写入位置？

out.write（c，“gs://my bucket/outputs/“+fileName”）它没有给我这个选项。你能编辑你的答案吗？对不起，我不明白这个问题。你是在问如何实现…从元素生成文件名…？不，我的问题是如何将流式数据上传到存储，每次从pubsub发布到它自己的文件嗯。将PCollection中的每个元素上传到它自己的文件如果您有自己的文件，则需要应用ParDo.of（new CreateFileFn（））如我的回答中所述，您能澄清一下您在这方面遇到的困难吗？我的问题是，这应该在常规http存储请求中完成，还是应该以与dataflw不同的方式完成。如果是，请提供唯一的代码。没有找到任何示例。