Apache spark 将S3a重置回S3
我使用以下命令设置s3a以切换AWS emr-6.2.0中的角色Apache spark 将S3a重置回S3,apache-spark,amazon-s3,amazon-emr,Apache Spark,Amazon S3,Amazon Emr,我使用以下命令设置s3a以切换AWS emr-6.2.0中的角色 sparky.sparkContext._jsc.hadoopConfiguration().set( "fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider", ) sparky.sparkContext.
sparky.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider",
)
sparky.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", new_credentials["Credentials"]["AccessKeyId"]
)
sparky.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", new_credentials["Credentials"]["SecretAccessKey"]
)
sparky.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.session.token", new_credentials["Credentials"]["SessionToken"]
)
问题是我如何切换回当前角色?
简单的解决方案似乎是:
spark.sparkContext.\u jsc.hadoopConfiguration().clear()
但这会清除所有错误&我会得到以下错误
>>> df_disp_prod = spark.read.csv(
... "s3://sandboxes-analysis/demo_inventory/distinct_disp_prod_id.tsv",
... sep=r"\t",
... header=True,
... )
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 535, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o515.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3336)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3356)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123)
似乎得到了“s3a://my\u sweet\u home/prodigal\u son\u returns”
获取所有hadoop配置的字典您应该能够使用s3://URL来使用EMR s3连接器,使用s3a://来使用apache连接器,只需设置s3a auth细节,它们应该共存 清除配置似乎会丢失太多信息。如果您确实需要删除某个选项,可以使用
Configuration.unset(key)
configs = [
"fs.s3a.aws.credentials.provider",
"fs.s3a.access.key",
"fs.s3a.secret.key",
"fs.s3a.session.token",
]
_ = [print(c, spark.sparkContext._jsc.hadoopConfiguration().get(c)) for c in configs]
spark.sparkContext._jsc.hadoopConfiguration().clear()
spark.sparkContext._jsc.hadoopConfiguration().reloadConfiguration()
def hadoop_config_dict(spark):
hadoop_config_d = {
e.getKey(): e.getValue()
for e in spark.sparkContext._jsc.hadoopConfiguration().iterator()
}
# Sort by key and return
return {k: hadoop_config_d[k] for k in sorted(hadoop_config_d)}