Apache spark 从Spark写入S3,无需访问权限和密钥

Apache spark 从Spark写入S3,无需访问权限和密钥,apache-spark,amazon-s3,amazon-ec2,permissions,Apache Spark,Amazon S3,Amazon Ec2,Permissions,我们的EC2服务器配置为允许在使用DefaultAWSCredentialsProviderChain时访问my bucket,因此使用普通AWS SDK的以下代码可以正常工作: AmazonS3 s3client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain()); s3client.putObject(new PutObjectRequest("my-bucket", "my-object", "/path/to/my

我们的EC2服务器配置为允许在使用
DefaultAWSCredentialsProviderChain
时访问
my bucket
,因此使用普通AWS SDK的以下代码可以正常工作:

AmazonS3 s3client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
s3client.putObject(new PutObjectRequest("my-bucket", "my-object", "/path/to/my-file.txt"));
Spark的
s3aoOutputStream
在内部使用相同的SDK,但是尝试在不提供访问权限和密钥的情况下上载文件不起作用:

sc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
// not setting access and secret key
JavaRDD<String> rdd = sc.parallelize(Arrays.asList("hello", "stackoverflow"));
rdd.saveAsTextFile("s3a://my-bucket/my-file-txt");
sc.hadoopConfiguration().set(“fs.s3a.impl”、“org.apache.hadoop.fs.s3a.S3AFileSystem”);
//未设置访问和密钥
JavaRDD rdd=sc.parallelize(Arrays.asList(“hello”,“stackoverflow”));
saveAsTextFile(“s3a://my bucket/my file txt”);
给出:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 25DF243A166206A0, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: Ki5SP11xQEMKb0m0UZNXb4FhfWLMdbehbknQ+jeZuO/wjhwurjkFoEYVfrQfW1KIq435Lo9jPkw=  
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)  
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)  
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)  
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)  
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:130)
    <truncated>
线程“main”com.amazonaws.services.s3.model.AmazonS3Exception:状态代码:403,AWS服务:Amazon s3,AWS请求ID:25DF243A166206A0,AWS错误代码:null,AWS错误消息:禁止,s3扩展请求ID:ki5SP11xqemkb0M0uznxB4FwWLMDBEHbknq+jeZuO/wjwurjkFoeyVfW1kiQ435Lo9JPKW=
在com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
在com.amazonaws.http.AmazonHttpClient.executeHelper上(AmazonHttpClient.java:421)
在com.amazonaws.http.AmazonHttpClient.execute上(AmazonHttpClient.java:232)
位于com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
位于com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
位于com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
位于org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
位于org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
位于org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
位于org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:130)

有没有办法强迫Spark使用默认凭证提供程序链而不是依赖访问和密钥?

从技术上讲,Hadoop的s3a输出流。查看堆栈跟踪以查看要针对谁提交错误报告:)

s3a确实支持Hadoop 2.7+的实例凭据

如果无法连接,则需要在CP上安装2.7 JAR,并使用AWS SDK的确切版本(我记得是1.7.4)

Spark有一个小功能:如果您提交工作并设置了AWS_*env vars,那么它会将其拾取,复制为fs.s3a密钥,从而将其传播到您的系统。

是的,请参见此处的解决方案。