数据流:使用Python SDK的顶级模块:单元素PCollection
我在看孵化器梁存储库上的word_counting.py示例(链接自数据流文档),我想修改它以获得出现次数最多的n。这是我的管道:数据流:使用Python SDK的顶级模块:单元素PCollection,python,google-cloud-platform,google-cloud-dataflow,Python,Google Cloud Platform,Google Cloud Dataflow,我在看孵化器梁存储库上的word_counting.py示例(链接自数据流文档),我想修改它以获得出现次数最多的n。这是我的管道: counts = (lines | 'split' >> (beam.ParDo(WordExtractingDoFn()) .with_output_types(unicode)) | 'pair_with_one' >> beam.Map(lambda x:
counts = (lines
| 'split' >> (beam.ParDo(WordExtractingDoFn())
.with_output_types(unicode))
| 'pair_with_one' >> beam.Map(lambda x: (x, 1))
| 'group' >> beam.GroupByKey()
| 'count' >> beam.Map(lambda (word, ones): (word, sum(ones)))
| 'top' >> beam.combiners.Top.Of('top', 10, key=lambda (word, c): c) # 'top' is the only added line
output = counts | 'format' >> beam.Map(lambda (word, c): '%s: %s' % (word, c))
output | 'write' >> beam.io.Write(beam.io.TextFileSink(known_args.output))
我使用Top.Of()方法添加了一行,但它似乎返回了一个数组为单个元素的PCollection(我在等待一个有序的PCollection,但查看文档时,PCollection似乎是无序的集合
管道运行时,beam.Map仅在一个元素(即整个数组)上循环,并且在“format”中,lambda函数会引发错误,因为它无法将整个数组映射到元组(word,c)
在这一步中,我应该如何在不中断管道的情况下处理此单元素PCollection?如果要将iterables的
PCollection
扩展为这些iterables元素的PCollection
,可以使用FlatMap
,其参数是从元素到结果iterable的函数:在在我们的情况下,元素本身是可分解的,所以我们使用标识函数
counts = ...
| 'top' >> beam.combiners.Top.Of('top', 10, key=lambda (word, c): c)
| 'expand' >> beam.FlatMap(lambda word_counts: word_counts) # sic!
output = counts | 'format' >> beam.Map(lambda (word, c): '%s: %s' % (word, c))
...