Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/317.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在数据流上运行Apache Beam管道会引发错误(DirectRunner运行时没有问题)_Python_Google Cloud Dataflow_Apache Beam - Fatal编程技术网

Python 在数据流上运行Apache Beam管道会引发错误(DirectRunner运行时没有问题)

Python 在数据流上运行Apache Beam管道会引发错误(DirectRunner运行时没有问题),python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,运行正常的管道在使用数据流时引发错误。所以我尝试了一个简单的管道,得到了相同的错误 相同的管道在DirectRunner上运行时不会出现问题。 执行环境是一个Google数据实验室 请让我知道我的环境中是否有任何需要更改/更新的内容或任何其他建议 非常感谢,, e 将触发以下错误: CalledProcessErrorTraceback (most recent call last) <ipython-input-17-b4be63f7802f> in <module>(

运行正常的管道在使用数据流时引发错误。所以我尝试了一个简单的管道,得到了相同的错误

相同的管道在DirectRunner上运行时不会出现问题。 执行环境是一个Google数据实验室

请让我知道我的环境中是否有任何需要更改/更新的内容或任何其他建议

非常感谢,, e

将触发以下错误:

CalledProcessErrorTraceback (most recent call last)
<ipython-input-17-b4be63f7802f> in <module>()
      5  )
      6 
----> 7 p1.run().wait_until_finish()

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    174       finally:
    175         shutil.rmtree(tmpdir)
--> 176     return self.runner.run(self)
    177 
    178   def __enter__(self):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in run(self, pipeline)
    250     # Create the job
    251     result = DataflowPipelineResult(
--> 252         self.dataflow_client.create_job(self.job), self)
    253 
    254     self._metrics = DataflowMetrics(self.dataflow_client, result, self.job)

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/retry.pyc in wrapper(*args, **kwargs)
    166       while True:
    167         try:
--> 168           return fun(*args, **kwargs)
    169         except Exception as exn:  # pylint: disable=broad-except
    170           if not retry_filter(exn):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job(self, job)
    423   def create_job(self, job):
    424     """Creates job description. May stage and/or submit for remote execution."""
--> 425     self.create_job_description(job)
    426 
    427     # Stage and submit the job when necessary

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job_description(self, job)
    446     """Creates a job described by the workflow proto."""
    447     resources = dependency.stage_job_resources(
--> 448         job.options, file_copy=self._gcs_file_copy)
    449     job.proto.environment = Environment(
    450         packages=resources, options=job.options,

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in stage_job_resources(options, file_copy, build_setup_args, temp_dir, populate_requirements_cache)
    377       else:
    378         sdk_remote_location = setup_options.sdk_location
--> 379       _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    380       resources.append(names.DATAFLOW_SDK_TARBALL_FILE)
    381     else:

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    462   elif sdk_remote_location == 'pypi':
    463     logging.info('Staging the SDK tarball from PyPI to %s', staged_path)
--> 464     _dependency_file_copy(_download_pypi_sdk_package(temp_dir), staged_path)
    465   else:
    466     raise RuntimeError(

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _download_pypi_sdk_package(temp_dir)
    525       '--no-binary', ':all:', '--no-deps']
    526   logging.info('Executing command: %s', cmd_args)
--> 527   processes.check_call(cmd_args)
    528   zip_expected = os.path.join(
    529       temp_dir, '%s-%s.zip' % (package_name, version))

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/processes.pyc in check_call(*args, **kwargs)
     42   if force_shell:
     43     kwargs['shell'] = True
---> 44   return subprocess.check_call(*args, **kwargs)
     45 
     46 

/usr/local/envs/py2env/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
    188         if cmd is None:
    189             cmd = popenargs[0]
--> 190         raise CalledProcessError(retcode, cmd)
    191     return 0
    192 

CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/tmpyyiizo', 'google-cloud-dataflow==2.0.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 2
CalledProcessErrorTraceback(最近一次调用上次)
在()
5  )
6.
---->7 p1.运行()。等待直到完成()
/运行中的usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc(self,test_runner_api)
174最后:
175 shutil.rmtree(tmpdir)
-->176返回自我。跑步者。跑步(自我)
177
178定义输入(自我):
/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in run(self,pipeline)
250#创造就业机会
251结果=DataflowPipelineResult(
-->252 self.dataflow\u client.create\u作业(self.job),self)
253
254 self.\u metrics=DataflowMetrics(self.dataflow\u客户端、结果、self.job)
/包装器中的usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/retry.pyc(*args,**kwargs)
166虽然正确:
167尝试:
-->168返回乐趣(*args,**kwargs)
169异常除外,如exn:#pylint:disable=broad except
170如果没有,请重试\u筛选器(exn):
/创建作业(self,job)中的usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc
423 def创建_作业(自我,作业):
424“创建作业描述。可以暂存和/或提交以供远程执行。”“”
-->425.自我创建工作描述(工作)
426
427#必要时准备并提交作业
/创建作业描述(self,job)中的usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc
446“创建工作流原型描述的作业”
447资源=dependency.stage\u作业\u资源(
-->448 job.options,文件\u copy=self.\u gcs\u文件\u copy)
449 job.proto.environment=环境(
450包=资源,选项=作业。选项,
/阶段作业资源中的usr/local/envs/py2env/lib/python2.7/site-packages/apache\u beam/runners/dataflow/internal/dependency.pyc(选项、文件副本、构建设置参数、临时目录、填充缓存)
377其他:
378 sdk\u远程\u位置=设置\u选项.sdk\u位置
-->379级波束sdk tarball(sdk远程位置、分级路径、临时目录)
380 resources.append(name.DATAFLOW\u SDK\u TARBALL\u文件)
381其他:
/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in_stage_beam_sdk_tarball(sdk_remote_位置,staged_路径,temp_dir)
462 elif sdk_remote_location==“pypi”:
463 logging.info('将SDK tarball从PyPI转移到%s',转移路径)
-->464依赖关系文件副本(下载pypi sdk包(临时目录),分阶段路径)
465其他:
466 raise运行时错误(
/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in_下载_pypi_sdk_包(temp_dir)
525'--无二进制',':全部:','--无deps']
526 logging.info('正在执行的命令:%s',cmd_args)
-->527进程。检查调用(cmd\u args)
528 zip_应为=os.path.join(
529临时目录“%s-%s.zip%”(包名称,版本))
/检查调用中的usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/processs.pyc(*args,**kwargs)
42如果强制外壳:
43 kwargs['shell']=真
--->44返回子流程。检查调用(*args,**kwargs)
45
46
/检查调用中的usr/local/envs/py2env/lib/python2.7/subprocess.pyc(*popenargs,**kwargs)
188如果cmd为None:
189 cmd=popenargs[0]
-->190升起被调用的进程错误(retcode,cmd)
191返回0
192
调用的进程错误:命令“['/usr/local/envs/py2env/bin/python','-m',pip',install','-download','/tmp/tmpyyiizo',google cloud dataflow==2.0.0','-no binary',':all:','-no deps']'返回非零退出状态2

我能够通过Jupyter笔记本(而不是Datalab本身)使用
DataflowRunner运行您的作业,而没有任何问题

在撰写本文时,我正在使用最新版本(v2.6.0)的
apache_beam[gcp]
Python SDK。您能用v2.6.0而不是v2.0.0重试吗

以下是我的跑步记录:

将apache_梁导入为梁
从apache_beam.pipeline导入PipelineOptions
从apache_beam.options.pipeline_options导入GoogleCloudOptions
从apache_beam.options.pipeline_options导入标准选项
BUCKET\u URL=“gs://YOUR\u BUCKET\u HERE/test”
导入操作系统
os.environ['GOOGLE\u应用程序\u凭据]='PATH\u TO\u YOUR\u SERVICE\u ACCOUNT\u JSON\u CREDS'
选项=管道选项()
google_cloud_options=options.view_as(GoogleCloudOptions)
google\u cloud\u options.project='YOUR\u project\u ID\u HERE'
google_cloud_options.job_name='try debug'
google_cloud_options.staging_location='%s/staging'%BUCKET_URL.'gs://archs4/staging'
google_cloud_options.temp_location='%s/tmp'%BUCKET_URL.'gs://archs4/temp'
选项。将_视为(标准选项)。runner='DataflowRunner'
p1=梁管道(选项=选项)
(p1 |“read'>>beam.io.ReadFromText('gs://dataflow samples/shakespeare/kinglear.txt'))
|'write'>>beam.io.WriteToText('gs://bucket/test.txt',num_shards=1)
)
p1.运行()。等待直到完成()
这是它运行的证据:

正如预期的那样,作业失败了,因为我没有对
'gs://bucket/test.txt'
的写入权限-您也可以在屏幕截图左下角的stacktrace中看到这一点。但是,作业已成功提交到Google Cloud Dataflow,并已运行

CalledProcessErrorTraceback (most recent call last)
<ipython-input-17-b4be63f7802f> in <module>()
      5  )
      6 
----> 7 p1.run().wait_until_finish()

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    174       finally:
    175         shutil.rmtree(tmpdir)
--> 176     return self.runner.run(self)
    177 
    178   def __enter__(self):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in run(self, pipeline)
    250     # Create the job
    251     result = DataflowPipelineResult(
--> 252         self.dataflow_client.create_job(self.job), self)
    253 
    254     self._metrics = DataflowMetrics(self.dataflow_client, result, self.job)

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/retry.pyc in wrapper(*args, **kwargs)
    166       while True:
    167         try:
--> 168           return fun(*args, **kwargs)
    169         except Exception as exn:  # pylint: disable=broad-except
    170           if not retry_filter(exn):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job(self, job)
    423   def create_job(self, job):
    424     """Creates job description. May stage and/or submit for remote execution."""
--> 425     self.create_job_description(job)
    426 
    427     # Stage and submit the job when necessary

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job_description(self, job)
    446     """Creates a job described by the workflow proto."""
    447     resources = dependency.stage_job_resources(
--> 448         job.options, file_copy=self._gcs_file_copy)
    449     job.proto.environment = Environment(
    450         packages=resources, options=job.options,

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in stage_job_resources(options, file_copy, build_setup_args, temp_dir, populate_requirements_cache)
    377       else:
    378         sdk_remote_location = setup_options.sdk_location
--> 379       _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    380       resources.append(names.DATAFLOW_SDK_TARBALL_FILE)
    381     else:

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    462   elif sdk_remote_location == 'pypi':
    463     logging.info('Staging the SDK tarball from PyPI to %s', staged_path)
--> 464     _dependency_file_copy(_download_pypi_sdk_package(temp_dir), staged_path)
    465   else:
    466     raise RuntimeError(

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _download_pypi_sdk_package(temp_dir)
    525       '--no-binary', ':all:', '--no-deps']
    526   logging.info('Executing command: %s', cmd_args)
--> 527   processes.check_call(cmd_args)
    528   zip_expected = os.path.join(
    529       temp_dir, '%s-%s.zip' % (package_name, version))

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/processes.pyc in check_call(*args, **kwargs)
     42   if force_shell:
     43     kwargs['shell'] = True
---> 44   return subprocess.check_call(*args, **kwargs)
     45 
     46 

/usr/local/envs/py2env/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
    188         if cmd is None:
    189             cmd = popenargs[0]
--> 190         raise CalledProcessError(retcode, cmd)
    191     return 0
    192 

CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/tmpyyiizo', 'google-cloud-dataflow==2.0.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 2