Google cloud dataflow 最新版本的TextIO（2.11及更高版本）是否能够并行读取文件中的行？_Google Cloud Dataflow_Apache Beam

Google cloud dataflow 最新版本的TextIO（2.11及更高版本）是否能够并行读取文件中的行？

google-cloud-dataflow

Google cloud dataflow 最新版本的TextIO（2.11及更高版本）是否能够并行读取文件中的行？,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我通读了splittable DoFn博客，据我所知，这项功能已经在TextIO（用于Cloud dataflow runner）中提供。我不清楚的是，使用TextIO我将能够并行读取给定文件中的行。仅对于Java，TextIO源将自动并行读取未压缩的文件这没有正式的文档记录，但是TextIO源是FileBaseSource的一个子类，它允许查找。也就是说，如果员工决定拆分工作，则可以这样做。请参阅FileBasedSource拆分的代码。Cubez的回答很好。我还想补充一点，作为PTTran

我通读了splittable DoFn博客，据我所知，这项功能已经在TextIO（用于Cloud dataflow runner）中提供。我不清楚的是，使用TextIO我将能够并行读取给定文件中的行。

仅对于Java，TextIO源将自动并行读取未压缩的文件

这没有正式的文档记录，但是TextIO源是FileBaseSource的一个子类，它允许查找。也就是说，如果员工决定拆分工作，则可以这样做。请参阅FileBasedSource拆分的代码。

Cubez的回答很好。我还想补充一点，作为PTTransform和I/O连接器的TextIO实现了expand（）方法：

@Override
public PCollection<String> expand(PCollection<FileIO.ReadableFile> input) {
  return input.apply(
      "Read all via FileBasedSource",
      new ReadAllViaFileBasedSource<>(
          getDesiredBundleSizeBytes(),
          new CreateTextSourceFn(getDelimiter()),
          StringUtf8Coder.of()));
}

@覆盖
公共PCollection展开（PCollection输入）{
返回input.apply(
“通过FileBasedSource读取所有数据”，
新的ReadAllViaFileBasedSource(
getDesiredBundleSizeBytes（），
新建CreateTextSourceFn（getDelimiter（）），
StringUtf8Coder.of（））；
}

如果我们进一步观察，我们可以看到ReadAllViaFileBasedSource类也有如下定义的expand（）方法：

@Override
public PCollection<T> expand(PCollection<ReadableFile> input) {
return input
    .apply("Split into ranges", ParDo.of(new SplitIntoRangesFn(desiredBundleSizeBytes)))
    .apply("Reshuffle", Reshuffle.viaRandomKey())
    .apply("Read ranges", ParDo.of(new ReadFileRangesFn<>(createSource)))
    .setCoder(coder);

@覆盖
公共PCollection展开（PCollection输入）{
返回输入
.apply（“拆分为范围”，第页，共页（新拆分为范围Fn（所需的BundleSizeBytes）））
.apply（“Reshuffle”，Reshuffle.viaandomkey（））
.apply（“读取范围”，第页，共页（新的ReadFileRangesFn（createSource）））
.设置编码器（编码器）；

}

这就是底层运行程序在执行器之间分配PCollection并并行读取的方式