Java 为什么在本地模式下运行spark时,我需要使用DataFrames API执行读取以通过AWS验证?
此代码工作并通过:Java 为什么在本地模式下运行spark时,我需要使用DataFrames API执行读取以通过AWS验证?,java,amazon-web-services,apache-spark,amazon-s3,Java,Amazon Web Services,Apache Spark,Amazon S3,此代码工作并通过: public class Test { public static void main(String[] args) throws IOException { AWSCredentials h = new AWSCredentials(); SparkConf conf = new SparkConf() .setMaster("local[*]") .setAppName
public class Test {
public static void main(String[] args) throws IOException {
AWSCredentials h = new AWSCredentials();
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Test")
.set("fs.s3a.access.key", h.access_key_id)
.set("fs.s3a.secret.key", h.secret_access_key);
if (h.session_token != null) {
conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");
conf.set("fs.s3a.session.token", h.session_token);
}
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
long count = spark.read().text("s3a://mybucket/path-to-files/file+9+0000000223.bin").javaRDD().count();
System.out.println("count from scala spark is: " + count);
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<String> maxwellRdd = sc.textFile("s3a://mybucket/path-to-files/*");
System.out.println("count is: " + maxwellRdd.count());
sc.stop();
}
}
我不相信你的第一个有效——更具体地说,如果它确实有效,那是因为有人从环境变量或EC2 IAM设置中选取了你的凭据 如果您试图在spark conf中设置s3a选项,则需要在每个选项前面加上
“spark.hadoop”。
简单测试:创建spark上下文后,调用
sc.hadoopConfiguration
并在那里查找选项(这些都是在org.apache.hadoop.fs.s3a.Constants
中定义的,如果你想100%确定你没有任何打字错误。第一个确实有效,因为它读取文件,大概是因为它是通过~/.aws/凭证读取的,我想这似乎很奇怪(没有默认配置文件).Aaaanyway您完全正确,选项需要用spark.hadoop预先添加。这就解决了问题。干杯。我怀疑它是否会从~/aws/credentials获得,但是如果您设置了aws_uenvvars,spark submit会自动获取它们并将它们转换为fs.s3n/s3a属性。很高兴看到一切正常
public class Test {
public static void main(String[] args) throws IOException {
AWSCredentials h = new AWSCredentials();
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Test")
.set("fs.s3a.access.key", h.access_key_id)
.set("fs.s3a.secret.key", h.secret_access_key);
if (h.session_token != null) {
conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");
conf.set("fs.s3a.session.token", h.session_token);
}
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
//long count = spark.read().text("s3a://mybucket/path-to-files/file+9+0000000223.bin").javaRDD().count();
//System.out.println("count from scala spark is: " + count);
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<String> maxwellRdd = sc.textFile("s3a://mybucket/path-to-files/*");
System.out.println("count is: " + maxwellRdd.count());
sc.stop();
}
}
dependencies {
compile group: 'org.ini4j', name: 'ini4j', version: '0.5.4'
compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.8'
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.2.1'
compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '2.8.3'
//compile group: 'com.amazonaws', name: 'aws-java-sdk', version: '1.11.313'
testCompile group: 'junit', name: 'junit', version: '4.12'
}