Google cloud storage 在作业执行之前筛选GCS URI_Google Cloud Storage_Google Cloud Dataflow_Apache Beam

Google cloud storage 在作业执行之前筛选GCS URI

google-cloud-storage google-cloud-dataflow

Google cloud storage 在作业执行之前筛选GCS URI,google-cloud-storage,google-cloud-dataflow,apache-beam,Google Cloud Storage,Google Cloud Dataflow,Apache Beam,我经常遇到一个无法解决的用例。假设我有一个类似于gs://mybucket/mydata/*/files.json的文件模式，其中*应该与日期匹配假设我想保留251个日期（这是一个例子，假设有很多日期，但没有元模式来匹配它们，比如2019*或其他）。目前，我有两个选择：为每一个文件创建一个TextIO，这是过分的，几乎每次都会失败（图太大）读取所有数据，然后在我的作业中从数据中筛选数据：例如，当您有10 TB的数据，而您只需要10 Gb时，这也太过分了在我的例子中，我只想做类似的

我经常遇到一个无法解决的用例。假设我有一个类似于

gs://mybucket/mydata/*/files.json的文件模式，其中*应该与日期匹配
假设我想保留251个日期（这是一个例子，假设有很多日期，但没有元模式来匹配它们，比如2019*或其他）。
目前，我有两个选择：

为每一个文件创建一个TextIO，这是过分的，几乎每次都会失败（图太大）
读取所有数据，然后在我的作业中从数据中筛选数据：例如，当您有10 TB的数据，而您只需要10 Gb时，这也太过分了

在我的例子中，我只想做类似的事情（伪代码）：
这个指令实际上在图形上生成了一个TextIO任务。
如果我错过了什么，我很抱歉，但我找不到方法来做
谢谢
好的，我找到了，名字叫mileading我：
Example 2: reading a PCollection of filenames.

 Pipeline p = ...;

 // E.g. the filenames might be computed from other data in the pipeline, or
 // read from a data source.
 PCollection<String> filenames = ...;

 // Read all files in the collection.
 PCollection<String> lines =
     filenames
         .apply(FileIO.matchAll())
         .apply(FileIO.readMatches())
         .apply(TextIO.readFiles());

示例2：读取文件名的PCollection。
管道p=。。。；
//例如，文件名可以根据管道中的其他数据计算，或者
//从数据源读取。
PCollection文件名=。。。；
//读取集合中的所有文件。
收集线=
文件名
.apply（FileIO.matchAll（））
.apply（FileIO.readMatches（））
.apply（TextIO.readFiles（））；

（引用自Apache Beam文档）
因此，我们需要生成一个URI的PCollection（使用Create/of
）或从管道中读取它，然后匹配所有的URI（或我猜是模式）并读取所有文件
 好的，我找到了，名字叫mileading我：
Example 2: reading a PCollection of filenames.

 Pipeline p = ...;

 // E.g. the filenames might be computed from other data in the pipeline, or
 // read from a data source.
 PCollection<String> filenames = ...;

 // Read all files in the collection.
 PCollection<String> lines =
     filenames
         .apply(FileIO.matchAll())
         .apply(FileIO.readMatches())
         .apply(TextIO.readFiles());

示例2：读取文件名的PCollection。
管道p=。。。；
//例如，文件名可以根据管道中的其他数据计算，或者
//从数据源读取。
PCollection文件名=。。。；
//读取集合中的所有文件。
收集线=
文件名
.apply（FileIO.matchAll（））
.apply（FileIO.readMatches（））
.apply（TextIO.readFiles（））；

（引用自Apache Beam文档）
因此，我们需要生成一个URI的PCollection（使用Create/of
）或从管道中读取它，然后匹配所有的URI（或我猜是模式）并读取所有文件