Apache Beam GroupByKey（）在Python中的Google数据流上运行时失败_Python_Google Cloud Dataflow_Apache Beam

Apache Beam GroupByKey（）在Python中的Google数据流上运行时失败

python google-cloud-dataflow

Apache Beam GroupByKey（）在Python中的Google数据流上运行时失败,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我有一个用于ApacheBeam的Python SDK 2.2.0管道这个管道几乎是一个典型的字数统计：我有两对名称，格式为（“John Doe，Jane Smith”，1），我试图计算出每对名称一起出现的次数，如下所示： p_collection | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))

我有一个用于ApacheBeam的Python SDK 2.2.0管道

这个管道几乎是一个典型的字数统计：我有两对名称，格式为

（“John Doe，Jane Smith”，1）

，我试图计算出每对名称一起出现的次数，如下所示：

p_collection
            | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
            | "GroupByKey" >> beam.GroupByKey()
            | "AggregateGroups" >> beam.Map(lambda (pair, ones): (pair, sum(ones)))
            | "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})

当我用一个小数据集在本地运行这段代码时，它工作得非常好

但当我将其部署到Google Cloud DataFlow时，我得到以下错误：

尝试执行工作项时引发异常 423109085466017585:回溯（最近一次调用上次）：文件 “/usr/local/lib/python2.7/dist-packages/dataflow\u-worker/batchworker.py”，第582行，在do_work_executor.execute（）文件中 “/usr/local/lib/python2.7/dist packages/dataflow\u worker/executor.py”，第167行，在execute op.start（）文件中 “dataflow\u worker/shuffle\u operations.py”，第49行，在数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start def start（self）：文件“dataflow\u worker/shuffle\u operations.py”，第行 50，英寸数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start 使用self.scoped_start_state:文件 “dataflow\u worker/shuffle\u operations.py”，第65行，in 数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start 使用self.shuffle\u source.reader（）作为reader:File “dataflow\u worker/shuffle\u operations.py”，第69行，in 数据流\u worker.shuffle\u operations.GroupedShuffleReadOperation.start self.output（窗口化的_值）文件 “apache_beam/runners/worker/operations.py”，第154行，中 apache_beam.runners.worker.operations.Operation.output 赛顿演员组（接受者）， self.receivers[output_index]）.receive（加窗_值）文件 “apache_beam/runners/worker/operations.py”，第86行，in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast（操作、使用者）.process（窗口化的_值）文件 “dataflow\u worker/shuffle\u operations.py”，第233行，在 dataflow\u worker.shuffle\u operations.BatchGroupalSobyWindowOperation.process self.output（wvalue.with_value（（k，wvalue.value）））文件 “apache_beam/runners/worker/operations.py”，第154行，中 apache_beam.runners.worker.operations.Operation.output 赛顿演员组（接受者）， self.receivers[output_index]）.receive（加窗_值）文件 “apache_beam/runners/worker/operations.py”，第86行，in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast（操作、使用者）.process（窗口化的_值）文件 “apache_beam/runners/worker/operations.py”，第339行，中 apache_beam.runners.worker.operations.DoOperation.process with self.scoped_进程_状态：文件 “apache_beam/runners/worker/operations.py”，第340行，中 apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive（o）文件“apache_beam/runners/common.py”，第382行，在apache_beam.runners.common.dofnlunner.receive中 self.process（窗口化的_值）文件“apache_beam/runners/common.py”，第390行，在apache_beam.runners.common.dofnlunner.process中 self._reraise_增强（exn）文件“apache_beam/runners/common.py”，第415行，在apache_beam.runners.common.dofnlunner.中，放大提升文件“apache_beam/runners/common.py”，第388行，在 apache_beam.runners.common.DoFnRunner.process self.do\u fn\u invoker.invoke\u进程（窗口化的\u值）文件 “apache_beam/runners/common.py”，第189行，中 apache_beam.runners.common.SimpleInvoker.invoke_进程 self.output\u processor.process\u输出（文件 “apache_beam/runners/common.py”，第480行，中 apache_beam.runners.common._OutputProcessor.process_outputs self.main\u receiver.receive（窗口化的\u值）文件 “apache_beam/runners/worker/operations.py”，第86行，in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast（操作、使用者）.process（窗口化的_值）文件 “apache_beam/runners/worker/operations.py”，第339行，中 apache_beam.runners.worker.operations.DoOperation.process with self.scoped_进程_状态：文件 “apache_beam/runners/worker/operations.py”，第340行，中 apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive（o）文件“apache_beam/runners/common.py”，第382行，在apache_beam.runners.common.dofnlunner.receive中 self.process（窗口化的_值）文件“apache_beam/runners/common.py”，第390行，在apache_beam.runners.common.dofnlunner.process中 self._reraise_增强（exn）文件“apache_beam/runners/common.py”，第431行，在apache_beam.runners.common.dofnlunner.\u reraise_扩充中生成新的\u exn、无、原始\u回溯文件 “apache_beam/runners/common.py”，第388行，中 apache_beam.runners.common.DoFnRunner.process self.do\u fn\u invoker.invoke\u进程（窗口化的\u值）文件 “apache_beam/runners/common.py”，第189行，中 apache_beam.runners.common.SimpleInvoker.invoke_进程 self.output\u processor.process\u输出（文件 “apache_beam/runners/common.py”，第480行，中 apache_beam.runners.common._OutputProcessor.process_outputs self.main\u receiver.receive（窗口化的\u值）文件 “apache_beam/runners/worker/operations.py”，第84行，中 apache_beam.runners.worker.operations.ConsumerSet.receive self.update\u counters\u start（窗口化的\u值）文件 “apache_beam/runners/worker/operations.py”，第90行，in apache_beam.runners.worker.operations.ConsumerSet.update_counters_start self.opcounter.update_from（窗口_值）文件 “apache_beam/runners/worker/opcounters.py”，第63行，中 apache_beam.runners.worker.opcounters.OperationCounters.update_from self.do_示例（窗口化_值）文件 “apache_beam/runners/worker/opcounters.py”，第81行，中 apache_beam.runners.worker.opcounters.OperationCounters.do_示例 self.coder_impl.get_估计_大小_和_可观测值（加窗_值））文件“apache_beam/coders/coder_impl.py”，第730行，在 apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_估计值_大小_和_可观测值 def get_估计的_大小_和_可观测值（self、value、nested=False）：文件“apache_beam/coders/coder_impl.py”，

p_collection
        | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
        | "GroupAndSum" >> beam.CombinePerKey(sum)
        | "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})

... | beam.GroupByKey()
    | beam.Map(lambda k_v: (k_v[0], foo(list(k_v[1]))))
    | ...

pcollection_obj
        | "MapWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
        | "GroupByKeyAndSum" >> beam.CombinePerKey(sum)
        | "CreateDictionary" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})