Amazon web services 如何将aws代理主机设置为Spark配置
了解如何将aws代理主机和区域设置为spark会话或spark上下文 我能够在aws javasdk代码中进行设置,并且工作正常Amazon web services 如何将aws代理主机设置为Spark配置,amazon-web-services,apache-spark,amazon-s3,Amazon Web Services,Apache Spark,Amazon S3,了解如何将aws代理主机和区域设置为spark会话或spark上下文 我能够在aws javasdk代码中进行设置,并且工作正常 ClientConfiguration clientConfig = new ClientConfiguration(); clientConfig.setProxyHost("aws-proxy-qa.xxxxx.organization.com"); clientConfig.setProxyPort(8099));
ClientConfiguration clientConfig = new ClientConfiguration();
clientConfig.setProxyHost("aws-proxy-qa.xxxxx.organization.com");
clientConfig.setProxyPort(8099));
AmazonS3ClientBuilder.standard()
.withRegion(getAWSRegion(Regions.US_WEST_2)
.withClientConfiguration(clientConfig) //Setting aws proxy host
可以帮助我将相同的内容设置为spark context(区域和代理),因为我正在读取的s3文件的区域与emr区域不同。基于fs.s3a.access.key和fs.s3a.secret.key的区域将自动确定 就像其他s3属性一样 将此设置为sparkConf
/**
* example getSparkSessionForS3
* @return
*/
def getSparkSessionForS3():SparkSession = {
val conf = new SparkConf()
.setAppName("testS3File")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.fs.s3a.endpoint", "yourendpoint")
.set("spark.hadoop.fs.s3a.connection.maximum", "200")
.set("spark.hadoop.fs.s3a.fast.upload", "true")
.set("spark.hadoop.fs.s3a.connection.establish.timeout", "500")
.set("spark.hadoop.fs.s3a.connection.timeout", "5000")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.hadoop.com.amazonaws.services.s3.enableV4", "true")
.set("spark.hadoop.com.amazonaws.services.s3.enforceV4", "true")
.set("spark.hadoop.fs.s3a.proxy.host","yourhost")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
spark
}
根据fs.s3a.access.key和fs.s3a.secret.key自动确定区域 就像其他s3属性一样 将此设置为sparkConf
/**
* example getSparkSessionForS3
* @return
*/
def getSparkSessionForS3():SparkSession = {
val conf = new SparkConf()
.setAppName("testS3File")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.fs.s3a.endpoint", "yourendpoint")
.set("spark.hadoop.fs.s3a.connection.maximum", "200")
.set("spark.hadoop.fs.s3a.fast.upload", "true")
.set("spark.hadoop.fs.s3a.connection.establish.timeout", "500")
.set("spark.hadoop.fs.s3a.connection.timeout", "5000")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.hadoop.com.amazonaws.services.s3.enableV4", "true")
.set("spark.hadoop.com.amazonaws.services.s3.enforceV4", "true")
.set("spark.hadoop.fs.s3a.proxy.host","yourhost")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
spark
}
看起来不错,但是(a)您不需要fs.s3a.impl one和(b)我不认为那些com.amazonaws选项在s3a客户端中可以使用。hadoop s3a文档涵盖了使用v4签名(很快将成为强制性)的切换,看起来不错,但是(a)您不需要fs.s3a.impl one和(b)我认为s3a客户机中不会用到那些com.amazonaws选项。hadoop s3a文档涵盖了使用v4签名的切换(这将很快成为强制性的)