Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark parquet s3错误:AmazonS3异常:状态代码:403,AWS服务:Amazon s3,AWS请求ID:xxxxx,AWS错误代码:null_Apache Spark_Amazon S3_Parquet_Spark Cassandra Connector - Fatal编程技术网

Apache spark Spark parquet s3错误:AmazonS3异常:状态代码:403,AWS服务:Amazon s3,AWS请求ID:xxxxx,AWS错误代码:null

Apache spark Spark parquet s3错误:AmazonS3异常:状态代码:403,AWS服务:Amazon s3,AWS请求ID:xxxxx,AWS错误代码:null,apache-spark,amazon-s3,parquet,spark-cassandra-connector,Apache Spark,Amazon S3,Parquet,Spark Cassandra Connector,我试图读取AWS S3中存在的拼花地板文件,并得到以下错误 17/12/19 11:27:40 DEBUG DAGScheduler: ShuffleMapTask finished on 0 17/12/19 11:27:40 DEBUG DAGScheduler: submitStage(ResultStage 2) 17/12/19 11:27:40 DEBUG DAGScheduler: missing: List(ShuffleMapStage 1) 17/12/19 11:27:40

我试图读取AWS S3中存在的拼花地板文件,并得到以下错误

17/12/19 11:27:40 DEBUG DAGScheduler: ShuffleMapTask finished on 0
17/12/19 11:27:40 DEBUG DAGScheduler: submitStage(ResultStage 2)
17/12/19 11:27:40 DEBUG DAGScheduler: missing: List(ShuffleMapStage 1)
17/12/19 11:27:40 DEBUG DAGScheduler: submitStage(ShuffleMapStage 1)
17/12/19 11:27:40 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_1, runningTasks: 2
17/12/19 11:27:40 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 4, ip-xxx-xxx-xxx-xxx.ec2.internal): com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxxxxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxxxxxxx/xxxxxxxxx=
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:688)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:71)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
    at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
    at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
    at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
    at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

17/12/19 11:27:40 DEBUG DAGScheduler: submitStage(ResultStage 2)
我不知道为什么会出现这个错误。这是我的密码:-

val conf = new SparkConf(true).setAppName("TestRead")
conf.set(SPARK_CASS_CONN_HOST, config.cassHost)
conf.set("spark.cassandra.connection.timeout_ms","10000")

val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.access.key", config.s3AccessKey)
sc.hadoopConfiguration.set("fs.s3a.secret.key", config.s3SecretKey)

val sqlContext = new SQLContext(sc)

val data = sqlContext.read.parquet(s"s3a://${config.s3bucket}/${config.s3folder}")

println(s"data size is ${data.count()}")
data.show()
if(config.cassWriteToTable){
      println("Writing to Cassandra Table")
      data.write.mode(SaveMode.Append).format("org.apache.spark.sql.cassandra").options(Map("table" -> config.cassTable, "keyspace" -> config.cassKeyspace)).save()
}

println("Stopping TestRead...")
sc.stop()
我已在build.sbt文件中包含以下依赖项:-

"org.apache.spark" % "spark-core_2.10" % "1.6.1" % "provided,test" ,
  "org.apache.spark" % "spark-sql_2.10" % "1.6.1" % "provided",
  "com.typesafe.play" % "play-json_2.10" % "2.4.6" excludeAll(ExclusionRule(organization = "com.fasterxml.jackson.core")),
  "mysql" % "mysql-connector-java" % "5.1.39",
  "com.amazonaws" % "aws-java-sdk-pom" % "1.11.7" exclude("commons-beanutils","commons-beanutils") exclude("commons-collections","commons-collections") excludeAll ExclusionRule(organization = "javax.servlet")

这里可能有什么问题?

403/禁止:您的登录名无法访问您试图读取的文件

关于NPE,在issues.apache.org上提交一份针对spark的bug报告;他们会努力找出谁该受责备


在你这么做之前:确保你使用的是最新的Spark版本,并搜索该NPE。无需提交副本,尤其是在已经修复的情况下。

我遇到了类似的问题,并通过将Hadoop库从2.7.x升级到2.8.5解决了它

对于文档()

S3A改进:添加插入任何AWSCredentialsProvider的功能, 支持在中从hadoop凭据提供程序API读取s3a凭据 除了XML配置文件外,还支持Amazon STS临时 证书

您可能还希望对凭据提供程序使用“org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider”,并在访问和密钥之外指定令牌

"org.apache.spark" % "spark-core_2.10" % "1.6.1" % "provided,test" ,
  "org.apache.spark" % "spark-sql_2.10" % "1.6.1" % "provided",
  "com.typesafe.play" % "play-json_2.10" % "2.4.6" excludeAll(ExclusionRule(organization = "com.fasterxml.jackson.core")),
  "mysql" % "mysql-connector-java" % "5.1.39",
  "com.amazonaws" % "aws-java-sdk-pom" % "1.11.7" exclude("commons-beanutils","commons-beanutils") exclude("commons-collections","commons-collections") excludeAll ExclusionRule(organization = "javax.servlet")