Apache Beam GroupByKey()在Python中的Google数据流上运行时失败
我有一个用于ApacheBeam的Python SDK 2.2.0管道 这个管道几乎是一个典型的字数统计:我有两对名称,格式为Apache Beam GroupByKey()在Python中的Google数据流上运行时失败,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我有一个用于ApacheBeam的Python SDK 2.2.0管道 这个管道几乎是一个典型的字数统计:我有两对名称,格式为(“John Doe,Jane Smith”,1),我试图计算出每对名称一起出现的次数,如下所示: p_collection | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
(“John Doe,Jane Smith”,1)
,我试图计算出每对名称一起出现的次数,如下所示:
p_collection
| "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
| "GroupByKey" >> beam.GroupByKey()
| "AggregateGroups" >> beam.Map(lambda (pair, ones): (pair, sum(ones)))
| "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})
当我用一个小数据集在本地运行这段代码时,它工作得非常好
但当我将其部署到Google Cloud DataFlow时,我得到以下错误:
尝试执行工作项时引发异常
423109085466017585:回溯(最近一次调用上次):文件
“/usr/local/lib/python2.7/dist-packages/dataflow\u-worker/batchworker.py”,
第582行,在do_work_executor.execute()文件中
“/usr/local/lib/python2.7/dist packages/dataflow\u worker/executor.py”,
第167行,在execute op.start()文件中
“dataflow\u worker/shuffle\u operations.py”,第49行,在
数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start
def start(self):文件“dataflow\u worker/shuffle\u operations.py”,第行
50,英寸
数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start
使用self.scoped_start_state:文件
“dataflow\u worker/shuffle\u operations.py”,第65行,in
数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start
使用self.shuffle\u source.reader()作为reader:File
“dataflow\u worker/shuffle\u operations.py”,第69行,in
数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start
self.output(窗口化的_值)文件
“apache_beam/runners/worker/operations.py”,第154行,中
apache_beam.runners.worker.operations.Operation.output
赛顿演员组(接受者),
self.receivers[output_index]).receive(加窗_值)文件
“apache_beam/runners/worker/operations.py”,第86行,in
apache_beam.runners.worker.operations.ConsumerSet.receive
cython.cast(操作、使用者).process(窗口化的_值)文件
“dataflow\u worker/shuffle\u operations.py”,第233行,在
dataflow\u worker.shuffle\u operations.BatchGroupalSobyWindowOperation.process
self.output(wvalue.with_value((k,wvalue.value)))文件
“apache_beam/runners/worker/operations.py”,第154行,中
apache_beam.runners.worker.operations.Operation.output
赛顿演员组(接受者),
self.receivers[output_index]).receive(加窗_值)文件
“apache_beam/runners/worker/operations.py”,第86行,in
apache_beam.runners.worker.operations.ConsumerSet.receive
cython.cast(操作、使用者).process(窗口化的_值)文件
“apache_beam/runners/worker/operations.py”,第339行,中
apache_beam.runners.worker.operations.DoOperation.process with
self.scoped_进程_状态:文件
“apache_beam/runners/worker/operations.py”,第340行,中
apache_beam.runners.worker.operations.DoOperation.process
self.dofn_receiver.receive(o)文件“apache_beam/runners/common.py”,
第382行,在apache_beam.runners.common.dofnlunner.receive中
self.process(窗口化的_值)文件“apache_beam/runners/common.py”,
第390行,在apache_beam.runners.common.dofnlunner.process中
self._reraise_增强(exn)文件“apache_beam/runners/common.py”,
第415行,在apache_beam.runners.common.dofnlunner.中,放大
提升文件“apache_beam/runners/common.py”,第388行,在
apache_beam.runners.common.DoFnRunner.process
self.do\u fn\u invoker.invoke\u进程(窗口化的\u值)文件
“apache_beam/runners/common.py”,第189行,中
apache_beam.runners.common.SimpleInvoker.invoke_进程
self.output\u processor.process\u输出(文件
“apache_beam/runners/common.py”,第480行,中
apache_beam.runners.common._OutputProcessor.process_outputs
self.main\u receiver.receive(窗口化的\u值)文件
“apache_beam/runners/worker/operations.py”,第86行,in
apache_beam.runners.worker.operations.ConsumerSet.receive
cython.cast(操作、使用者).process(窗口化的_值)文件
“apache_beam/runners/worker/operations.py”,第339行,中
apache_beam.runners.worker.operations.DoOperation.process with
self.scoped_进程_状态:文件
“apache_beam/runners/worker/operations.py”,第340行,中
apache_beam.runners.worker.operations.DoOperation.process
self.dofn_receiver.receive(o)文件“apache_beam/runners/common.py”,
第382行,在apache_beam.runners.common.dofnlunner.receive中
self.process(窗口化的_值)文件“apache_beam/runners/common.py”,
第390行,在apache_beam.runners.common.dofnlunner.process中
self._reraise_增强(exn)文件“apache_beam/runners/common.py”,
第431行,在apache_beam.runners.common.dofnlunner.\u reraise_扩充中
生成新的\u exn、无、原始\u回溯文件
“apache_beam/runners/common.py”,第388行,中
apache_beam.runners.common.DoFnRunner.process
self.do\u fn\u invoker.invoke\u进程(窗口化的\u值)文件
“apache_beam/runners/common.py”,第189行,中
apache_beam.runners.common.SimpleInvoker.invoke_进程
self.output\u processor.process\u输出(文件
“apache_beam/runners/common.py”,第480行,中
apache_beam.runners.common._OutputProcessor.process_outputs
self.main\u receiver.receive(窗口化的\u值)文件
“apache_beam/runners/worker/operations.py”,第84行,中
apache_beam.runners.worker.operations.ConsumerSet.receive
self.update\u counters\u start(窗口化的\u值)文件
“apache_beam/runners/worker/operations.py”,第90行,in
apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
self.opcounter.update_from(窗口_值)文件
“apache_beam/runners/worker/opcounters.py”,第63行,中
apache_beam.runners.worker.opcounters.OperationCounters.update_from
self.do_示例(窗口化_值)文件
“apache_beam/runners/worker/opcounters.py”,第81行,中
apache_beam.runners.worker.opcounters.OperationCounters.do_示例
self.coder_impl.get_估计_大小_和_可观测值(加窗_值))
文件“apache_beam/coders/coder_impl.py”,第730行,在
apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_估计值_大小_和_可观测值
def get_估计的_大小_和_可观测值(self、value、nested=False):
文件“apache_beam/coders/coder_impl.py”,
p_collection
| "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
| "GroupAndSum" >> beam.CombinePerKey(sum)
| "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})
... | beam.GroupByKey()
| beam.Map(lambda k_v: (k_v[0], foo(list(k_v[1]))))
| ...
pcollection_obj
| "MapWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
| "GroupByKeyAndSum" >> beam.CombinePerKey(sum)
| "CreateDictionary" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})