Python 2.7 气流异常:数据流失败,返回代码为2
我正在尝试执行一个DataflowPython文件,该文件使用DataFlowPythonOperator通过气流DAG从GCS存储桶读取文本文件。我已经能够独立地执行python文件,但是当我通过airflow执行它时,它失败了。我正在使用服务帐户对默认gcp连接进行身份验证。 我在执行作业时遇到的错误是:Python 2.7 气流异常:数据流失败,返回代码为2,python-2.7,google-cloud-platform,google-cloud-dataflow,airflow,Python 2.7,Google Cloud Platform,Google Cloud Dataflow,Airflow,我正在尝试执行一个DataflowPython文件,该文件使用DataFlowPythonOperator通过气流DAG从GCS存储桶读取文本文件。我已经能够独立地执行python文件,但是当我通过airflow执行它时,它失败了。我正在使用服务帐户对默认gcp连接进行身份验证。 我在执行作业时遇到的错误是: {gcp_dataflow_hook.py:108} INFO - Start waiting for DataFlow process to complete. {models.py:1
{gcp_dataflow_hook.py:108} INFO - Start waiting for DataFlow process to complete.
{models.py:1417} ERROR - DataFlow failed with return code 2
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1374, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/dataflow_operator.py", line 182, in execute
self.py_file, self.py_options)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 152, in start_python_dataflow
task_id, variables, dataflow, name, ["python"] + py_options)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 138, in _start_dataflow
_Dataflow(cmd).wait_for_done()
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 119, in wait_for_done
self._proc.returncode))
Exception: DataFlow failed with return code 2
我的脚本:
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataFlowPythonOperator
from datetime import datetime, timedelta
# Default DAG parameters
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': <email>,
'email_on_failure': False,
'email_on_retry': False,
'start_date': datetime(2018, 4, 30),
'retries': 1,
'retry_delay': timedelta(minutes=1),
'dataflow_default_options': {
'project': '<Project ID>'
}
}
dag = DAG(
dag_id='df_dag_readfromgcs',
default_args=default_args,
schedule_interval=timedelta(minutes=60)
)
task1 = DataFlowPythonOperator(
task_id='task1',
py_file='~<path>/1readfromgcs.py',
gcp_conn_id='default_google_cloud_connection',
dag=dag
)
感谢您对我的问题的想法和帮助。此异常源于
\u proc
,它是子流程。它从shell返回一个退出代码
我还没有使用过这个组件。根据执行的内容,此退出代码2将说明退出的原因。例如,bash中的退出代码表示:
误用贝壳内置物
并且可以连接到
缺少关键字或命令,或权限问题
因此,它可能连接到底层数据流配置。尝试在模拟用户的同时手动执行文件。这很可能是身份验证问题,正如您所说的,因为我正在尝试使用可能没有所需权限的服务帐户。我将尝试修改使用的帐户,然后重试。谢谢你的帮助,我会在我能够做出更改后回来。我对气流不太熟悉,但我认为没有什么奇怪的事情会导致数据流作业失败。在任何情况下,如果您直接在数据流中执行管道(不通过气流),那么管道工作正常,那么错误看起来确实在气流侧。如果应用程序以方法wait_for_done()
结束,我知道作业在GCP项目中运行,但在执行过程中失败,因此您应该能够在数据流UI和日志中找到更多详细信息。请查看是否有任何相关信息可添加。
from __future__ import absolute_import
import argparse
import logging
import apache_beam as beam
import apache_beam.pipeline as pipeline
import apache_beam.io as beamio
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import ReadFromText
def runCode(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument('--input',
default='<Input file path>',
help='File name')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--project=<project name>',
'--runner=DataflowRunner',
'--job_name=<job name>',
'--region=europe-west1',
'--staging_location=<GCS staging location>',
'--temp_location=<GCS temp location>'
])
pipeline_options = PipelineOptions(pipeline_args)
p = beam.pipeline.Pipeline(options=pipeline_options)
rows = p | 'read' >> beam.io.ReadFromText(known_args.input)
p.run().wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
runCode()
def wait_for_done(self):
reads = [self._proc.stderr.fileno(), self._proc.stdout.fileno()]
self.log.info("Start waiting for DataFlow process to complete.")
while self._proc.poll() is None:
ret = select.select(reads, [], [], 5)
if ret is not None:
for fd in ret[0]:
line = self._line(fd)
self.log.debug(line[:-1])
else:
self.log.info("Waiting for DataFlow process to complete.")
if self._proc.returncode is not 0:
raise Exception("DataFlow failed with return code {}".format(
self._proc.returncode))