Amazon s3 使用dataproc hadoop cluster和airflow将数据从google云存储移动到S3
我正在尝试将大量数据从地面军事系统传输到S3存储桶。我已经使用GoogleDataProc构建了一个hadoop集群 我可以通过Hadoop CLI使用以下命令运行作业:Amazon s3 使用dataproc hadoop cluster和airflow将数据从google云存储移动到S3,amazon-s3,google-cloud-platform,google-cloud-storage,airflow,google-cloud-dataproc,Amazon S3,Google Cloud Platform,Google Cloud Storage,Airflow,Google Cloud Dataproc,我正在尝试将大量数据从地面军事系统传输到S3存储桶。我已经使用GoogleDataProc构建了一个hadoop集群 我可以通过Hadoop CLI使用以下命令运行作业: hadoop distcp -update gs://GCS-bucket/folder s3a://[my_aws_access_id]:[my_aws_secret]@aws-bucket/folder 我是mapreduce和hadoop的新手。我正试图使用DataProcHadoopOperator将其添加到我的气
hadoop distcp -update gs://GCS-bucket/folder s3a://[my_aws_access_id]:[my_aws_secret]@aws-bucket/folder
我是mapreduce和hadoop的新手。我正试图使用DataProcHadoopOperator
将其添加到我的气流工作流程中:
export_to_s3 = DataProcHadoopOperator(
task_id='export_to_s3',
main_jar=None,
main_class=None,
arguments=None,
archives=None,
files=None,
job_name='{{task.task_id}}_{{ds_nodash}}',
cluster_name='optimize-m',
dataproc_hadoop_properties=None,
dataproc_hadoop_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
region='global',
dag=dag
)
我的气流停靠在一个计算引擎实例上运行
我不知道如何使它,使它创造了以下作为一项工作
hadoop distcp -update gs://GCS-bucket/folder s3a://[my_aws_access_id]:[my_aws_secret]@aws-bucket/folder
我遵循建议,完成了以下任务:
export_to_s3 = DataProcHadoopOperator(
task_id='export_to_s3',
main_jar='file:///usr/lib/hadoop-mapreduce/hadoop-distcp.jar',
main_class=None,
arguments='-update gs://umg-comm-tech-dev/data/apollo/QA/ s3a://[mys3accessid]:[mys3secret]@s3://umg-ers-analytics/qubole/user-data/pitched/optimize/QA/'.split(' '),
archives=None,
files=None,
job_name='{{task.task_id}}_{{ds_nodash}}',
cluster_name='optimize',
dataproc_hadoop_properties=None,
dataproc_hadoop_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
region='global',
dag=dag
)
但是,我现在遇到以下错误:
18/01/18 10:13:42 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.2-hadoop2
18/01/18 10:13:42 WARN s3native.S3xLoginHelper: The Filesystem URI contains login details. This is insecure and may be unsupported in future.
18/01/18 10:13:43 WARN s3a.S3AFileSystem: Client: Amazon S3 error 400: 400 Bad Request; Bad Request (retryable)
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 8F6A80AA7432A696), S3 Extended Request ID: U6j5J9djR5UPPjhbjjLOtn7dG4IXDyMZfTD6CuFk5V6MXdUP65ArF56zP4Okx2NScxqYVh/UCTI=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:226)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:462)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
18/01/18 10:13:43 ERROR tools.DistCp: Invalid arguments:
org.apache.hadoop.fs.s3a.AWSS3IOException: doesBucketExist on s3: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 8F6A80AA7432A696), S3 Extended Request ID: U6j5J9djR5UPPjhbjjLOtn7dG4IXDyMZfTD6CuFk5V6MXdUP65ArF56zP4Okx2NScxqYVh/UCTI=: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 8F6A80AA7432A696)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:282)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:226)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:462)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 8F6A80AA7432A696), S3 Extended Request ID: U6j5J9djR5UPPjhbjjLOtn7dG4IXDyMZfTD6CuFk5V6MXdUP65ArF56zP4Okx2NScxqYVh/UCTI=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
... 18 more
Invalid arguments: doesBucketExist on s3: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 8F6A80AA7432A696), S3 Extended Request ID: U6j5J9djR5UPPjhbjjLOtn7dG4IXDyMZfTD6CuFk5V6MXdUP65ArF56zP4Okx2NScxqYVh/UCTI=: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 8F6A80AA7432A696)
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB
-delete Delete from target, files missing in source
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied
to <= n
-filters <arg> The path to a file containing a list of
strings for paths to be excluded from the
copy.
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs
are saved
-m <arg> Max number of concurrent maps to use for
copy
-mapredSslConf <arg> Configuration for ssl config file, to use
with hftps://. Must be in the classpath.
-numListstatusThreads <arg> Number of threads to use for building file
listing (max 40).
-overwrite Choose to overwrite target files
unconditionally, even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If
-p is specified with no <arg>, then
preserves replication, block size, user,
group, permission, checksum type and
timestamps. raw.* xattrs are preserved when
both the source and destination paths are
in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is
independent of the -p flag. Refer to the
DistCp documentation for more details.
-sizelimit <arg> (Deprecated!) Limit number of files copied
to <= n bytes
-skipcrccheck Whether to skip CRC checks between source
and target paths.
-strategy <arg> Copy strategy to use. Default is dividing
work based on file sizes
-tmp <arg> Intermediate work path to be used for
atomic commit
-update Update target, copying only missingfiles or
directories
18/01/18 10:13:42信息gcs.GoogleHadoopFileSystemBase:GHFS版本:1.6.2-hadoop2
18/01/18 10:13:42警告s3native.S3xLoginHelper:文件系统URI包含登录详细信息。这是不安全的,将来可能不受支持。
18/01/18 10:13:43警告s3a.S3AFileSystem:客户端:Amazon S3错误400:400错误请求;错误请求(可重试)
com.amazonaws.services.s3.model.amazons3异常:错误请求(服务:Amazon s3;状态代码:400;错误代码:400错误请求;请求ID:8F6A80AA7432A696),s3扩展请求ID:U6J5J9DJR5UPPJJBJLOTN7DG4IXDYMZFTD6CUFK5V6MXDUP65ARF56ZP4OKX2NSCQYVH/UCI=
在com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
在com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)上
在com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
在com.amazonaws.http.AmazonHttpClient.execute上(AmazonHttpClient.java:310)
位于com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
位于com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
位于com.amazonaws.services.s3.AmazonS3Client.doesbuckeest(AmazonS3Client.java:1070)
位于org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
位于org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
位于org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
位于org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
位于org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
位于org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
位于org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
位于org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
位于org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:226)
位于org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
位于org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
位于org.apache.hadoop.tools.DistCp.main(DistCp.java:462)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:498)
位于org.apache.hadoop.util.RunJar.run(RunJar.java:234)
位于org.apache.hadoop.util.RunJar.main(RunJar.java:148)
位于com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
18/01/18 10:13:43错误工具。DistCp:无效参数:
org.apache.hadoop.fs.s3a.AWSS3IOException:doesbucket存在于s3:com.amazonaws.services.s3.model.AmazonS3Exception:Bad Request(服务:Amazon s3;状态代码:400;错误代码:400错误请求;请求ID:8F6A80AA7432A696),s3扩展请求ID:U6J5J5J9DJR5UPPJJJJJOOTN7DG4IXDYMZFTD6CUFK5V6V6xDxDxDv6K5VxDvxD65ARF56ZP4OKX2NSCxQYVHH/UCI=:错误请求(服务:Amazon S3;状态代码:400;错误代码:400错误请求;请求ID:8F6A80AA7432A696)
位于org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178)
位于org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:282)
位于org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
位于org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
位于org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
位于org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
位于org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
位于org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
位于org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
位于org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:226)
位于org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
位于org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
位于org.apache.hadoop.tools.DistCp.main(DistCp.java:462)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:498)
位于org.apache.hadoop.util.RunJar.run(RunJar.java:234)
位于org.apache.hadoop.util.RunJar.main(RunJar.java:148)
位于com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
由以下原因引起:com.amazonaws.services.s3.model.amazons3异常:错误请求(服务:Amazon s3;状态代码:400;错误代码:400错误请求;请求ID:8F6A80AA7432A696),s3扩展请求ID:U6J5J9DJR5UPPJHBJJJOTN7DG4IXDYMZFTD6CUFK5V6MXDUP65ARF56ZP4OKX2NSCQYVH/UCI=
在com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
在com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)上
在com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
在com.amazonaws.http.AmazonHttpClient.execute上(AmazonHttpClient.java:310)
位于com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
在com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3C
gsutil -m rsync -r gs://GCS s3://S3