Amazon s3 Can';t让Apache使用EMR操作符写入S3
我正在使用Airflow EMR操作符创建一个AWS EMR集群,该集群运行S3中包含的Jar文件,然后将输出写回S3。它似乎能够使用S3中的Jar文件运行作业,但我无法让它将输出写入S3。我能够让它在作为AWS EMR CLI Bash命令运行时将输出写入S3,但我需要使用EMR操作符来完成。我在步骤配置和Jar文件的环境配置中都设置了S3输出目录,但仍然无法让操作员写入 这是我的气流DAG的代码Amazon s3 Can';t让Apache使用EMR操作符写入S3,amazon-s3,airflow,amazon-emr,Amazon S3,Airflow,Amazon Emr,我正在使用Airflow EMR操作符创建一个AWS EMR集群,该集群运行S3中包含的Jar文件,然后将输出写回S3。它似乎能够使用S3中的Jar文件运行作业,但我无法让它将输出写入S3。我能够让它在作为AWS EMR CLI Bash命令运行时将输出写入S3,但我需要使用EMR操作符来完成。我在步骤配置和Jar文件的环境配置中都设置了S3输出目录,但仍然无法让操作员写入 这是我的气流DAG的代码 from datetime import datetime, timedelta import
from datetime import datetime, timedelta
import airflow
from airflow import DAG
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_terminate_job_flow_operator import EmrTerminateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.s3_file_transform_operator import S3FileTransformOperator
DEFAULT_ARGS = {
'owner': 'AIRFLOW_USER',
'depends_on_past': False,
'start_date': datetime(2019, 9, 9),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False
}
RUN_STEPS = [
{
"Name": "run-custom-create-emr",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster", "--master", "yarn", "--conf",
"spark.yarn.submit.waitAppCompletion=false", "--class", "CLASSPATH",
"s3://INPUT_JAR_FILE",
"s3://OUTPUT_DIR"
]
}
}
]
JOB_FLOW_OVERRIDES = {
"Name": "JOB_NAME",
"LogUri": "s3://LOG_DIR/",
"ReleaseLabel": "emr-5.23.0",
"Instances": {
"Ec2KeyName": "KP_USER_NAME",
"Ec2SubnetId": "SUBNET",
"EmrManagedMasterSecurityGroup": "SG-ID",
"EmrManagedSlaveSecurityGroup": "SG-ID",
"InstanceGroups": [
{
"Name": "Master nodes",
"Market": "ON_DEMAND",
"InstanceRole": "MASTER",
"InstanceType": "m4.large",
"InstanceCount": 1
},
{
"Name": "Slave nodes",
"Market": "ON_DEMAND",
"InstanceRole": "CORE",
"InstanceType": "m4.large",
"InstanceCount": 1
}
],
"TerminationProtected": True,
"KeepJobFlowAliveWhenNoSteps": True,
},
"Applications": [
{
"Name": "Spark"
},
{
"Name": "Ganglia"
},
{
"Name": "Hadoop"
},
{
"Name": "Hive"
}
],
"JobFlowRole": "ROLE_NAME",
"ServiceRole": "ROLE_NAME",
"ScaleDownBehavior": "TERMINATE_AT_TASK_COMPLETION",
"EbsRootVolumeSize": 10,
"Tags": [
{
"Key": "Country",
"Value": "us"
},
{
"Key": "Environment",
"Value": "dev"
}
]
}
dag = DAG(
'AWS-EMR-JOB',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
schedule_interval=None
)
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_connection_CustomCreate',
dag=dag
)
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=RUN_STEPS,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_creator.set_downstream(step_adder)
step_adder.set_downstream(step_checker)
step_checker.set_downstream(cluster_remover)
有人知道我如何解决这个问题吗?任何帮助都将不胜感激。我相信我刚刚解决了我的问题。在深入挖掘所有本地气流日志和S3 EMR日志后,我发现了一个Hadoop内存异常,因此,我增加了运行EMR所需的内核数量,现在似乎可以运行了。我认为jar的代码不是这个问题。但是,当我使用CLI bash命令创建EMR集群时,相同的jar文件工作正常,并将输出写入S3。也许我需要对S3Hook或S3FileTransformOperator做些什么?或者Xcom有什么我遗漏的吗?