Apache spark Pyspark保存到S3
我正在尝试将一个大文件保存到AmazonS3存储桶中。 以下代码可以完美地工作:Apache spark Pyspark保存到S3,apache-spark,amazon-s3,pyspark,spark-dataframe,Apache Spark,Amazon S3,Pyspark,Spark Dataframe,我正在尝试将一个大文件保存到AmazonS3存储桶中。 以下代码可以完美地工作: sqlContext.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"]).select('A').repartition(1).write \ .format("text") \ .mode("overwrite") \ .option("header", "false") \ .option("code
sqlContext.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"]).select('A').repartition(1).write \
.format("text") \
.mode("overwrite") \
.option("header", "false") \
.option("codec", "gzip") \
.save("s3n://BUCKETNAME/temp.txt")
但是,保存完整数据帧失败。我的笔记本中出现以下错误:
Py4JJavaError: An error occurred while calling o1274.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Date
at org.jets3t.service.model.StorageObject.getLastModifiedDate(StorageObject.java:376)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:176)
在spark应用程序UI中,作业被描述为成功
我有以下配置:
sc._jsc.hadoopConfiguration().set("fs.s3n.multipart.uploads.enabled", "true")
尝试调试我尝试了下面的方法,它可以正常工作
sqlContext.createDataFrame(full_df.select('columnA').take(5),['columnA']).select('columnA').repartition(1).write \
.format("text") \
.mode("overwrite") \
.option("header", "false") \
.option("codec", "gzip") \
.save("s3n://BUCKETNAME/temp.txt")
我发现下面的链接似乎是关于这个问题的,但我找不到一个工作包
谁能帮助解决这个神秘的错误呢?用Hadoop 2.7 JAR切换到s3a而不是S3n。S3n的时代已经结束了——我们只剩下停止回归了。使用Hadoop 2.7 JAR,通过S3n切换到s3a。S3n的时间已经过去了-只剩下停止回归了。这给了我一个
Py4JJavaError:调用o631.save时出错:java.lang.NoSuchMethodError:com.amazonaws.services.s3.transfer.TransferManagerConfiguration.SetMultiportupLoadThreshold(I)V
这似乎是因为hadoop版本。在spark-defaults.conf中,我有以下内容:com.amazonaws:aws java sdk:1.7.4,org.apache.hadoop:hadoop aws:2.7.2似乎与安装的hadoop一致,但是安装的hadoop是hadoop版本hadoop 2.0.0-cdh4.7.1
这是无法更改的,但是…这给了我一个Py4JJavaError:调用o631.save时出错:java.lang.NoSuchMethodError:com.amazonaws.services.s3.transfer.TransferManagerConfiguration.SetMultiportupLoadThreshold(I)V
这似乎是因为hadoop版本。在我的spark-defaults.conf中,我有以下内容:com.amazonaws:aws java sdk:1.7.4,org.apache.hadoop:hadoop aws:2.7.2,这似乎与安装的hadoop是一致的,但是hadoop版本hadoop 2.0.0-cdh4.7.1这是无法更改的。。。