Python ApacheBeam-如何保存DoFn中的变量以供以后在管道中使用？_Python_Google Cloud Dataflow_Apache Beam

Python ApacheBeam-如何保存DoFn中的变量以供以后在管道中使用？

python google-cloud-dataflow

Python ApacheBeam-如何保存DoFn中的变量以供以后在管道中使用？,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我已经构建了一个Beam/Dataflow管道来处理shapefile。我有一个简单的管道：梁管道（选项=管道选项）为p: 特征集合=（p |创建（[known_args.gcs_url]） |'LoadShapefile'>>beam.ParDo（LoadShapefile（））） |波束图（打印） class LoadShapefile（beam.DoFn）： def流程（自我、gcs_url）：将beam.io.gcp.gcsio.gcsio（）打开（gcs_url，'rb'）作为f

我已经构建了一个Beam/Dataflow管道来处理shapefile。我有一个简单的管道：

梁管道（选项=管道选项）为p:
特征集合=（p
|创建（[known_args.gcs_url]）
|'LoadShapefile'>>beam.ParDo（LoadShapefile（）））
|波束图（打印）

class LoadShapefile（beam.DoFn）：
def流程（自我、gcs_url）：
将beam.io.gcp.gcsio.gcsio（）打开（gcs_url，'rb'）作为f：
collection=bytescolection（f.read（））
返回iter（集合）

这个管道工作得很好，但我需要捕获

collection

的一个附加属性，该属性对其中的每个元素都不可用。我需要

collection.crs

作为

DoFn

或

beam.Map

中的变量或参数在管道中稍后提供，以便正确处理每个元素

我想退回这样的东西：

return (collection.crs, iter(collection))

    with beam.Pipeline(options=pipeline_options) as p:
        feature_collections = (p
         | beam.Create([known_args.gcs_url])
         | 'LoadShapefile' >> beam.ParDo(LoadShapefile()))
        collection_crs = beam.pvalue.AsSingleton(feature_collections['crsdata'])
        feature_collection = feature_collections['main']
        # Use these PCollections as you see fit.

但是我不知道如何分离集合迭代器和

.crs

属性，并使管道正常工作。基本上在非波束世界中，我可能会考虑设置一个全局变量<代码> CRS，这在任何地方都是可用的，但是在波束AFAIK中是不可能的。< /P> 在Beam中实现这一点的正确方法是什么

编辑：

collection.crs

是一个小的

dict

，看起来像这样：

{'init'：'epsg:2284'}

。这个dict永远不会包含超过两个项目，但是这个元数据对于正确处理

集合中的元素是至关重要的
您可以为您的小字典使用一个带标签的输出，然后将其用作下一步的辅助输入，但是您必须实现一个分支逻辑
在沿着管道传递数据之前，您不能立即使用这些信息来细化数据吗？
您可以为您的小词典使用标记输出，然后将其用作下一步的辅助输入，但您必须实现分支逻辑
在沿着管道传递数据之前，您不能立即使用这些信息来细化数据吗？
您可以使用集合的字典输出一组元组，如下所示：
class LoadShapefile(beam.DoFn):
    def process(self, gcs_url):
        with beam.io.gcp.gcsio.GcsIO().open(gcs_url, 'rb') as f:
            collection = BytesCollection(f.read())
            return [(elm, collection.crs) for elm in collection]

您还可以将is设置为侧面输入：
class LoadShapefile(beam.DoFn):
    def process(self, gcs_url):
        with beam.io.gcp.gcsio.GcsIO().open(gcs_url, 'rb') as f:
            collection = BytesCollection(f.read())
            for elm in collection:
                yield elm
            yield TaggedOutput('crsdata', collection.crs)

然后你会这样做：
return (collection.crs, iter(collection))

    with beam.Pipeline(options=pipeline_options) as p:
        feature_collections = (p
         | beam.Create([known_args.gcs_url])
         | 'LoadShapefile' >> beam.ParDo(LoadShapefile()))
        collection_crs = beam.pvalue.AsSingleton(feature_collections['crsdata'])
        feature_collection = feature_collections['main']
        # Use these PCollections as you see fit.

请注意，这仅适用于单个gcs_url输入。如果您有更多，那么您的辅助输入应该是映射
或列表
，而不是单例
您可以使用集合的字典输出一组元组，如下所示：
class LoadShapefile(beam.DoFn):
    def process(self, gcs_url):
        with beam.io.gcp.gcsio.GcsIO().open(gcs_url, 'rb') as f:
            collection = BytesCollection(f.read())
            return [(elm, collection.crs) for elm in collection]

您还可以将is设置为侧面输入：
class LoadShapefile(beam.DoFn):
    def process(self, gcs_url):
        with beam.io.gcp.gcsio.GcsIO().open(gcs_url, 'rb') as f:
            collection = BytesCollection(f.read())
            for elm in collection:
                yield elm
            yield TaggedOutput('crsdata', collection.crs)

然后你会这样做：
return (collection.crs, iter(collection))

    with beam.Pipeline(options=pipeline_options) as p:
        feature_collections = (p
         | beam.Create([known_args.gcs_url])
         | 'LoadShapefile' >> beam.ParDo(LoadShapefile()))
        collection_crs = beam.pvalue.AsSingleton(feature_collections['crsdata'])
        feature_collection = feature_collections['main']
        # Use these PCollections as you see fit.

请注意，这仅适用于单个gcs_url输入。如果您有更多，那么您的辅助输入应该是映射
或列表
，而不是单例
，
我可以在那里使用它，但我相信这会降低管道的并行性。我尝试设置一个带标签的输出，但始终无法使其工作。如果这是正确的方法，我会再次尝试让它工作。我不知道这是否正确，我会尝试：）也许和其中的链接有帮助？我可以在那里使用它，但我相信这会降低管道的并行性。我尝试设置一个带标签的输出，但始终无法使其工作。如果这是正确的方法，我会再次尝试让它工作。我不知道这是否正确，我会尝试：）也许和其中的链接有帮助？