Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/haskell/8.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
数据流:使用Python SDK的顶级模块:单元素PCollection_Python_Google Cloud Platform_Google Cloud Dataflow - Fatal编程技术网

数据流:使用Python SDK的顶级模块:单元素PCollection

数据流:使用Python SDK的顶级模块:单元素PCollection,python,google-cloud-platform,google-cloud-dataflow,Python,Google Cloud Platform,Google Cloud Dataflow,我在看孵化器梁存储库上的word_counting.py示例(链接自数据流文档),我想修改它以获得出现次数最多的n。这是我的管道: counts = (lines | 'split' >> (beam.ParDo(WordExtractingDoFn()) .with_output_types(unicode)) | 'pair_with_one' >> beam.Map(lambda x:

我在看孵化器梁存储库上的word_counting.py示例(链接自数据流文档),我想修改它以获得出现次数最多的n。这是我的管道:

  counts = (lines
        | 'split' >> (beam.ParDo(WordExtractingDoFn())
                      .with_output_types(unicode))
        | 'pair_with_one' >> beam.Map(lambda x: (x, 1))
        | 'group' >> beam.GroupByKey()
        | 'count' >> beam.Map(lambda (word, ones): (word, sum(ones)))
        | 'top' >> beam.combiners.Top.Of('top', 10, key=lambda (word, c): c) # 'top' is the only added line

  output = counts | 'format' >> beam.Map(lambda (word, c): '%s: %s' % (word, c))
  output | 'write' >> beam.io.Write(beam.io.TextFileSink(known_args.output))
我使用Top.Of()方法添加了一行,但它似乎返回了一个数组为单个元素的PCollection(我在等待一个有序的PCollection,但查看文档时,PCollection似乎是无序的集合

管道运行时,beam.Map仅在一个元素(即整个数组)上循环,并且在“format”中,lambda函数会引发错误,因为它无法将整个数组映射到元组(word,c)


在这一步中,我应该如何在不中断管道的情况下处理此单元素PCollection?

如果要将iterables的
PCollection
扩展为这些iterables元素的
PCollection
,可以使用
FlatMap
,其参数是从元素到结果iterable的函数:在在我们的情况下,元素本身是可分解的,所以我们使用标识函数

  counts = ...
        | 'top' >> beam.combiners.Top.Of('top', 10, key=lambda (word, c): c)
        | 'expand' >> beam.FlatMap(lambda word_counts: word_counts) # sic!

  output = counts | 'format' >> beam.Map(lambda (word, c): '%s: %s' % (word, c))
  ...