Amazon web services AWS EMR在加速端点配置上引发异常

Amazon web services AWS EMR在加速端点配置上引发异常,amazon-web-services,hadoop,amazon-s3,amazon-emr,Amazon Web Services,Hadoop,Amazon S3,Amazon Emr,这是我使用的EMR步骤 s3 dist cp--targetSize 1000--outputCodec=gz --s3Endpoint=bucket.s3-accelerate.amazonaws.com--groupBy'./(\d\d)/\d\d/'--src s3a://sourcebucket/--dest s3a://destbucket/ 加速端点的异常 EMR版本: Release label:emr-5.13.0 Hadoop distribution:Amazon 2.8.3

这是我使用的EMR步骤

s3 dist cp--targetSize 1000--outputCodec=gz --s3Endpoint=bucket.s3-accelerate.amazonaws.com--groupBy'./(\d\d)/\d\d/'--src s3a://sourcebucket/--dest s3a://destbucket/

加速端点的异常

EMR版本:

Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Hive 2.3.2, Pig 0.17.0, Hue 4.1.0, Presto 0.194
为s3 dist cp传递参数以克服此错误,我缺少了什么

Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache.get(LocalCache.java:3937)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4830)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.provider.DefaultS3Provider.getS3(DefaultS3Provider.java:55)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.provider.DefaultS3Provider.getS3(DefaultS3Provider.java:22)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.getClient(GlobalS3Executor.java:122)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:89)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.doesBucketExist(AmazonS3LiteClient.java:88)
    at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:138)
    at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:116)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.initialize(S3NativeFileSystem.java:448)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:109)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:869)
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
    at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.lang.IllegalStateException: To enable accelerate mode, please use AmazonS3ClientBuilder.withAccelerateModeEnabled(true)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.setEndpoint(AmazonS3Client.java:670)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AmazonWebServiceClient.withEndpoint(AmazonWebServiceClient.java:897)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.provider.DefaultS3Provider$S3CacheLoader.load(DefaultS3Provider.java:62)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.provider.DefaultS3Provider$S3CacheLoader.load(DefaultS3Provider.java:58)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
    ... 30 more
Command exiting with ret '1'

s3 dist cp构建在hadoop aws库上,该库不支持使用现成的加速桶

您希望使用hadoop aws和amazon-sdk-s3的依赖项创建自己的jar,在其中转换所需参数,并扩展s3ClientFactory以实现加速上载

Maven依赖项示例:

<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk-s3</artifactId>
</dependency>
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk-core</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>${hadoop.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>  
</dependency>
最后一步是为hadoop提供s3工厂类:

<property>
  <name>fs.s3a.s3.client.factory.impl</name>
  <value>example_package.AcceleratedS3ClientFactory</value>
</property>
在Args中输入所有必需的参数,如源和目标s3路径

注意:不要指定hadoop aws支持的特定于bucket的端点。它以与加速不兼容的方式使用它,每次都会得到相同的异常

链接:


这真是太好了。考虑把它添加到Hadoop问题跟踪器上的补丁/JIRA.但是:由于emr只使用(封闭源代码)AWS s3L连接器,我不知道它是否能接受它。@SteveLoughran这里我们正在为emr制作一个新的jar应用程序。EMR只是在主节点上使用“java-jar..”启动它。如果它已经打包了自己的hadoop aws、s3 sdk LIB和s3连接器,那么它将使用它们而不是节点上的连接器,因此Jar可以完全控制它如何完成任务。AmazonS3Client也完全基于java,没有隐藏代码:)或者我遗漏了任何一点吗?我在想,如果你有一个s3a连接器的扩展客户端,我们实际上可以在hadoop aws JAR中提供它。目前,它将“正常的”和“不一致的”捆绑在一起——强制列出不一致性以保持代码的诚实
<property>
  <name>fs.s3a.s3.client.factory.impl</name>
  <value>example_package.AcceleratedS3ClientFactory</value>
</property>
aws emr add-steps --cluster-id cluster_id \
--steps Type=CUSTOM_JAR,Name="a step name",Jar=s3://app/my-s3distcp-1.0.jar,\
Args=["key","value"]