Google cloud dataflow 如何在PCollection中使用自定义编码器<；千伏<；字符串，B>>；？_Google Cloud Dataflow

Google cloud dataflow 如何在PCollection中使用自定义编码器<；千伏<；字符串，B>>；？

google-cloud-dataflow

Google cloud dataflow 如何在PCollection中使用自定义编码器<；千伏<；字符串，B>>；？,google-cloud-dataflow,Google Cloud Dataflow,我正试图使用一个自定义的编码器，这样我就可以进行一些转换，但是我在让PCollection使用自定义编码器时遇到了问题，我怀疑（？？？）这是因为它被包装在一个KV中。具体而言： Pipeline p = Pipeline.create ... p.getCoderRegistry().registerCoder(MyClass.class, MyClassCoder.class); ... PCollection<String> input = ... PCollection&l

我正试图使用一个自定义的

编码器

，这样我就可以进行一些转换，但是我在让

PCollection

使用自定义编码器时遇到了问题，我怀疑（？？？）这是因为它被包装在一个

KV

中。具体而言：

Pipeline p = Pipeline.create ...
p.getCoderRegistry().registerCoder(MyClass.class, MyClassCoder.class);

...

PCollection<String> input = ...
PCollection<KV<String, MyClass>> t = input.apply(new ToKVTransform());

我看到另一个有点相关的问题（）的答案是将所有内容映射到字符串，并使用它在pcollection中传递内容。这真的是推荐的方法吗

（注意：实际代码是用Scala编写的，但我很确定这不是Scala Java的问题，所以我在这里将其翻译成Java。）

更新以包含Scala代码和更多背景：

因此，这是实际的异常本身（应该在开始时包括这一点）：

其中

com.example.schema.schema

是：

case class Schema(id: String, keyTypes: Map[String, Type])

class SchemaCoder extends com.google.cloud.dataflow.sdk.coders.CustomCoder[Schema] {
  def decode(inputStream: InputStream, context: Context): Schema = {
    val ois = new ObjectInputStream(inputStream)
    val id: String = ois.readObject().asInstanceOf[String]
    val javaMap: java.util.Map[String, Type] = ois.readObject().asInstanceOf[java.util.Map[String, Type]]
    ois.close()

    Schema(id, javaMap.asScala.toMap)
  }

  def encode(schema: Schema, outputStream: OutputStream, context: Context): Unit = {
    val baos = new ByteArrayOutputStream()
    val oos = new ObjectOutputStream(baos)
    oos.writeObject(schema.id)
    val javaMap: java.util.Map[String, Type] = schema.keyTypes.asJava
    oos.writeObject(javaMap)
    oos.close()

    val encoded = new String(Base64.encodeBase64(baos.toByteArray()))
    outputStream.write(encoded.getBytes())
  }
}

最后，

SchemaCoder

是：

case class Schema(id: String, keyTypes: Map[String, Type])

class SchemaCoder extends com.google.cloud.dataflow.sdk.coders.CustomCoder[Schema] {
  def decode(inputStream: InputStream, context: Context): Schema = {
    val ois = new ObjectInputStream(inputStream)
    val id: String = ois.readObject().asInstanceOf[String]
    val javaMap: java.util.Map[String, Type] = ois.readObject().asInstanceOf[java.util.Map[String, Type]]
    ois.close()

    Schema(id, javaMap.asScala.toMap)
  }

  def encode(schema: Schema, outputStream: OutputStream, context: Context): Unit = {
    val baos = new ByteArrayOutputStream()
    val oos = new ObjectOutputStream(baos)
    oos.writeObject(schema.id)
    val javaMap: java.util.Map[String, Type] = schema.keyTypes.asJava
    oos.writeObject(javaMap)
    oos.close()

    val encoded = new String(Base64.encodeBase64(baos.toByteArray()))
    outputStream.write(encoded.getBytes())
  }
}

====

Edit2:下面是

ToKVTransform

的实际外观：

class SchemaExtractorTransform extends PTransform[PCollection[String], PCollection[Schema]] {
  class InferSchemaFromStringWithKeyFn extends DoFn[String, KV[String, Schema]] {
    override def processElement(c: DoFn[String, KV[String, Schema]]#ProcessContext): Unit = {
      val line = c.element()
      inferSchemaFromString(line)
    }
  }

  class GetFirstFn extends DoFn[KV[String, java.lang.Iterable[Schema]], Schema] {
    override def processElement(c: DoFn[KV[String, java.lang.Iterable[Schema]], Schema]#ProcessContext): Unit = {
      val idAndSchemas: KV[String, java.lang.Iterable[Schema]] = c.element()
      val it: java.util.Iterator[Schema] = idAndSchemas.getValue().iterator()
      c.output(it.next())
    }
  }

  override def apply(inputLines: PCollection[String]): PCollection[Schema] = {
    val schemasWithKey: PCollection[KV[String, Schema]] = inputLines.apply(
      ParDo.named("InferSchemas").of(new InferSchemaFromStringWithKeyFn())
    )

    val keyed: PCollection[KV[String, java.lang.Iterable[Schema]]] = schemasWithKey.apply(
      GroupByKey.create()
    )

    val schemasOnly: PCollection[Schema] = keyed.apply(
      ParDo.named("GetFirst").of(new GetFirstFn())
    )

    schemasOnly
  }
}

这个问题不会在Java中重现；Scala对破坏数据流编码器推断的类型做了一些不同的处理。要解决这个问题，您可以在PCollection上调用setCoder来显式设置其编码器，例如

schemasWithKey.setCoder(KvCoder.of(StringUtf8Coder.of(), SchemaCoder.of());

下面是代码的Java版本，只是为了确保它所做的事情大致相同：

public static class SchemaExtractorTransform
  extends PTransform<PCollection<String>, PCollection<Schema>> {
  class InferSchemaFromStringWithKeyFn extends DoFn<String, KV<String, Schema>> {
    public void processElement(ProcessContext c) {
      c.output(KV.of(c.element(), new Schema()));
    }
  }

  class GetFirstFn extends DoFn<KV<String, java.lang.Iterable<Schema>>, Schema> {
    private static final long serialVersionUID = 0;
    public void processElement(ProcessContext c) {
      c.output(c.element().getValue().iterator().next());
    }
  }

  public PCollection<Schema> apply(PCollection<String> inputLines) {
    PCollection<KV<String, Schema>> schemasWithKey = inputLines.apply(
        ParDo.named("InferSchemas").of(new InferSchemaFromStringWithKeyFn()));

    PCollection<KV<String, java.lang.Iterable<Schema>>> keyed =
        schemasWithKey.apply(GroupByKey.<String, Schema>create());

    PCollection<Schema> schemasOnly =
        keyed.apply(ParDo.named("GetFirst").of(new GetFirstFn()));

    return schemasOnly;
  }
}

公共静态类SchemaExtractorTransform
转移{
类InferSchemaFromStringWithKeyFn扩展了DoFn{
公共void processElement（ProcessContext c）{
c、 输出（千伏（c.element（），newschema（））；
}
}
类GetFirstFn扩展了DoFn{
私有静态最终长serialVersionUID=0；
公共void processElement（ProcessContext c）{
c、 输出（c.element（）.getValue（）.iterator（）.next（））；
}
}
公共PCollection应用（PCollection输入行）{
PCollection schemasWithKey=inputLines.apply(
（新的InferSchemaFromStringWithKeyFn（））的ParDo.named（“InferSchemas”）；
PCollection键控=
schemasWithKey.apply（GroupByKey.create（））；
PCollection schemaOnly=
（new GetFirstFn（））的keyed.apply（ParDo.named（“GetFirst”）；
仅返回模式；
}
}

您是否可以包含不起作用的Scala代码？只要您在整个代码中使用相同的管道对象，您发布的代码片段应该可以工作。@danielm我已经用更多的代码和背景更新了原始问题。谢谢你看！你能提供ToKVTransform的代码吗？数据流通过转换自动传播类型信息，以便推断要使用的编码器，准确地了解您的案例中发生了什么将非常有帮助。谢谢完成！忽略糟糕的命名。啊，有（…）的KvCoder.！太棒了，这应该是我需要的提示；我会回来报到的。同时，这让我走得更远，所以答案对我来说是可行的。干杯