Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/286.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache Beam GroupByKey()在Python中的Google数据流上运行时失败_Python_Google Cloud Dataflow_Apache Beam - Fatal编程技术网

Apache Beam GroupByKey()在Python中的Google数据流上运行时失败

Apache Beam GroupByKey()在Python中的Google数据流上运行时失败,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我有一个用于ApacheBeam的Python SDK 2.2.0管道 这个管道几乎是一个典型的字数统计:我有两对名称,格式为(“John Doe,Jane Smith”,1),我试图计算出每对名称一起出现的次数,如下所示: p_collection | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))

我有一个用于ApacheBeam的Python SDK 2.2.0管道

这个管道几乎是一个典型的字数统计:我有两对名称,格式为
(“John Doe,Jane Smith”,1)
,我试图计算出每对名称一起出现的次数,如下所示:

p_collection
            | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
            | "GroupByKey" >> beam.GroupByKey()
            | "AggregateGroups" >> beam.Map(lambda (pair, ones): (pair, sum(ones)))
            | "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})
当我用一个小数据集在本地运行这段代码时,它工作得非常好

但当我将其部署到Google Cloud DataFlow时,我得到以下错误:

尝试执行工作项时引发异常 423109085466017585:回溯(最近一次调用上次):文件 “/usr/local/lib/python2.7/dist-packages/dataflow\u-worker/batchworker.py”, 第582行,在do_work_executor.execute()文件中 “/usr/local/lib/python2.7/dist packages/dataflow\u worker/executor.py”, 第167行,在execute op.start()文件中 “dataflow\u worker/shuffle\u operations.py”,第49行,在 数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start def start(self):文件“dataflow\u worker/shuffle\u operations.py”,第行 50,英寸 数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start 使用self.scoped_start_state:文件 “dataflow\u worker/shuffle\u operations.py”,第65行,in 数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start 使用self.shuffle\u source.reader()作为reader:File “dataflow\u worker/shuffle\u operations.py”,第69行,in 数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start self.output(窗口化的_值)文件 “apache_beam/runners/worker/operations.py”,第154行,中 apache_beam.runners.worker.operations.Operation.output 赛顿演员组(接受者), self.receivers[output_index]).receive(加窗_值)文件 “apache_beam/runners/worker/operations.py”,第86行,in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast(操作、使用者).process(窗口化的_值)文件 “dataflow\u worker/shuffle\u operations.py”,第233行,在 dataflow\u worker.shuffle\u operations.BatchGroupalSobyWindowOperation.process self.output(wvalue.with_value((k,wvalue.value)))文件 “apache_beam/runners/worker/operations.py”,第154行,中 apache_beam.runners.worker.operations.Operation.output 赛顿演员组(接受者), self.receivers[output_index]).receive(加窗_值)文件 “apache_beam/runners/worker/operations.py”,第86行,in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast(操作、使用者).process(窗口化的_值)文件 “apache_beam/runners/worker/operations.py”,第339行,中 apache_beam.runners.worker.operations.DoOperation.process with self.scoped_进程_状态:文件 “apache_beam/runners/worker/operations.py”,第340行,中 apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive(o)文件“apache_beam/runners/common.py”, 第382行,在apache_beam.runners.common.dofnlunner.receive中 self.process(窗口化的_值)文件“apache_beam/runners/common.py”, 第390行,在apache_beam.runners.common.dofnlunner.process中 self._reraise_增强(exn)文件“apache_beam/runners/common.py”, 第415行,在apache_beam.runners.common.dofnlunner.中,放大 提升文件“apache_beam/runners/common.py”,第388行,在 apache_beam.runners.common.DoFnRunner.process self.do\u fn\u invoker.invoke\u进程(窗口化的\u值)文件 “apache_beam/runners/common.py”,第189行,中 apache_beam.runners.common.SimpleInvoker.invoke_进程 self.output\u processor.process\u输出(文件 “apache_beam/runners/common.py”,第480行,中 apache_beam.runners.common._OutputProcessor.process_outputs self.main\u receiver.receive(窗口化的\u值)文件 “apache_beam/runners/worker/operations.py”,第86行,in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast(操作、使用者).process(窗口化的_值)文件 “apache_beam/runners/worker/operations.py”,第339行,中 apache_beam.runners.worker.operations.DoOperation.process with self.scoped_进程_状态:文件 “apache_beam/runners/worker/operations.py”,第340行,中 apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive(o)文件“apache_beam/runners/common.py”, 第382行,在apache_beam.runners.common.dofnlunner.receive中 self.process(窗口化的_值)文件“apache_beam/runners/common.py”, 第390行,在apache_beam.runners.common.dofnlunner.process中 self._reraise_增强(exn)文件“apache_beam/runners/common.py”, 第431行,在apache_beam.runners.common.dofnlunner.\u reraise_扩充中 生成新的\u exn、无、原始\u回溯文件 “apache_beam/runners/common.py”,第388行,中 apache_beam.runners.common.DoFnRunner.process self.do\u fn\u invoker.invoke\u进程(窗口化的\u值)文件 “apache_beam/runners/common.py”,第189行,中 apache_beam.runners.common.SimpleInvoker.invoke_进程 self.output\u processor.process\u输出(文件 “apache_beam/runners/common.py”,第480行,中 apache_beam.runners.common._OutputProcessor.process_outputs self.main\u receiver.receive(窗口化的\u值)文件 “apache_beam/runners/worker/operations.py”,第84行,中 apache_beam.runners.worker.operations.ConsumerSet.receive self.update\u counters\u start(窗口化的\u值)文件 “apache_beam/runners/worker/operations.py”,第90行,in apache_beam.runners.worker.operations.ConsumerSet.update_counters_start self.opcounter.update_from(窗口_值)文件 “apache_beam/runners/worker/opcounters.py”,第63行,中 apache_beam.runners.worker.opcounters.OperationCounters.update_from self.do_示例(窗口化_值)文件 “apache_beam/runners/worker/opcounters.py”,第81行,中 apache_beam.runners.worker.opcounters.OperationCounters.do_示例 self.coder_impl.get_估计_大小_和_可观测值(加窗_值)) 文件“apache_beam/coders/coder_impl.py”,第730行,在 apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_估计值_大小_和_可观测值 def get_估计的_大小_和_可观测值(self、value、nested=False): 文件“apache_beam/coders/coder_impl.py”,
p_collection
        | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
        | "GroupAndSum" >> beam.CombinePerKey(sum)
        | "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})
... | beam.GroupByKey()
    | beam.Map(lambda k_v: (k_v[0], foo(list(k_v[1]))))
    | ...
pcollection_obj
        | "MapWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
        | "GroupByKeyAndSum" >> beam.CombinePerKey(sum)
        | "CreateDictionary" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})