Python ParDo行为的Apache Beam解释_Python_Google Cloud Dataflow_Apache Beam

Python ParDo行为的Apache Beam解释

python google-cloud-dataflow

Python ParDo行为的Apache Beam解释,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,以ndjson格式的文本文件为例，下面的代码产生了我所期望的结果。一个ndjson文件，其中quotes.USD dict未列出，原始quotes元素已删除 def unnest_quotes(element): element['USDquotes'] = element['quotes']['USD'] del element['quotes'] return element p = beam.Pipeline(options=pipeline_

以ndjson格式的文本文件为例，下面的代码产生了我所期望的结果。一个ndjson文件，其中quotes.USD dict未列出，原始quotes元素已删除

  def unnest_quotes(element):
      element['USDquotes'] = element['quotes']['USD']
      del element['quotes']
      return element

  p = beam.Pipeline(options=pipeline_options)
  ReadJson = p | ReadFromText(known_args.input,coder=JsonCoder())
  MapFormattedJson = ReadJson | 'Map Function' >> beam.Map(unnest_quotes)
  MapFormattedJson | 'Write Map Output' >> WriteToText(known_args.output,coder=JsonCoder())

然而，当我试图用帕尔多实现同样的目标时，我不理解它的行为

  class UnnestQuotes(beam.DoFn):
    def process(self,element):
      element['USDquotes'] = element['quotes']['USD']
      del element['quotes']
      return element

  p = beam.Pipeline(options=pipeline_options)
  ReadJson = p | ReadFromText(known_args.input,coder=JsonCoder())
  ClassFormattedJson = ReadJson | 'Pardo' >> beam.ParDo(UnnestQuotes())
  ClassFormattedJson | 'Write Class Output' >> WriteToText(known_args.output,coder=JsonCoder())

这将生成一个文件，其中dict的每个键位于单独的行上，没有如下所示的值

"last_updated"
"name"
"symbol"
"rank"
"total_supply"
"max_supply"
"circulating_supply"
"website_slug"
"id"
"USDquotes"

这就好像Map函数生成的PCollection是完整dict，而Pardo为每个键生成一个PCollection

我知道我可以只使用map函数，但我需要了解这种行为，以便将来确实需要使用ParDo时使用。

我在这个答案的帮助下解决了这个问题。

正如我所经历的一样，FlatMap和Map的区别是一样的。要获得所需的行为，我所需要做的就是将Pardo返回的数据打包到一个列表中

  class UnnestQuotes(beam.DoFn):
    def process(self,element):
      element['USDquotes'] = element['quotes']['USD']
      del element['quotes']
      return [element]

你介意接受你自己的回答吗？供社会人士日后参考。我相信你必须等待2天后发布它。