Amazon web services gzFiles：从Spark中的s3存储桶读取csv.gz文件_Amazon Web Services_Apache Spark_Amazon S3_Apache Spark Sql_Amazon Emr

Amazon web services gzFiles：从Spark中的s3存储桶读取csv.gz文件

amazon-web-services apache-spark amazon-s3

Amazon web services gzFiles：从Spark中的s3存储桶读取csv.gz文件,amazon-web-services,apache-spark,amazon-s3,apache-spark-sql,amazon-emr,Amazon Web Services,Apache Spark,Amazon S3,Apache Spark Sql,Amazon Emr,我试图从s3 bucket读取数据Part-xxxx.csv.gz文件，并能够使用Intellij将输出写入s3 bucket 同样的程序，若我运行EMR（通过使用jar文件），那个么我将得到以下错误 Exception in thread "main" org.apache.spark.SparkException: Application application_1543327349114_0001 finished with failed status 似乎无法读取EMR中的gz文件。但

我试图从s3 bucket读取数据Part-xxxx.csv.gz文件，并能够使用Intellij将输出写入s3 bucket

同样的程序，若我运行EMR（通过使用jar文件），那个么我将得到以下错误

Exception in thread "main" org.apache.spark.SparkException: Application application_1543327349114_0001 finished with failed status

似乎无法读取EMR中的gz文件。但是如果输入文件是csv，那么它读取数据时不会出现任何问题

我的代码：

val df=spark.read.format（“csv”）.option（“header”，“true”）.option（“inferSchema”，“true”）.load（“s3a://test system/Samplefile.csv”）
df.createOrReplaceTempView（“数据”）
val res=spark.sql（“按id、地理id从数据组中选择计数（*）、id、地理id”）
res.coalesce（1）.write.format（“csv”）.选项（“标头”、“true”）.模式（“覆盖”）
.save（“s3a://test system/Output/Sampleoutput”）

我正在使用spark 2.3.0和Hadoop 2.7.3

关于这个问题，请帮助我如何阅读EMR中的

*.csv.gz

文件

标准日志：

18/11/28 07:41:22 INFO RMProxy: Connecting to ResourceManager at ip-172-30-3-95.ap-northeast-1.compute.internal/172.30.3.95:8032
18/11/28 07:41:23 INFO Client: Requesting a new application from cluster with 2 NodeManagers
18/11/28 07:41:23 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (106496 MB per container)
18/11/28 07:41:23 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
18/11/28 07:41:23 INFO Client: Setting up container launch context for our AM
18/11/28 07:41:23 INFO Client: Setting up the launch environment for our AM container
18/11/28 07:41:23 INFO Client: Preparing resources for our AM container
18/11/28 07:41:25 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/11/28 07:41:29 INFO Client: Uploading resource file:/mnt/tmp/spark-d10f886a-bf7b-4a0a-a91f-2f0353bb7b67/__spark_libs__1058363571489040863.zip -> hdfs://ip-172-30-3-95.ap-northeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1543390729790_0001/__spark_libs__1058363571489040863.zip
18/11/28 07:41:33 INFO Client: Uploading resource s3://test-system/SparkApps/jar/rxsicheck.jar -> hdfs://ip-172-30-3-95.ap-northeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1543390729790_0001/rxsicheck.jar
18/11/28 07:41:33 INFO S3NativeFileSystem: Opening 's3://test-system/SparkApps/jar/rxsicheck.jar' for reading
18/11/28 07:41:33 INFO Client: Uploading resource file:/mnt/tmp/spark-d10f886a-bf7b-4a0a-a91f-2f0353bb7b67/__spark_conf__1080415411630926230.zip -> hdfs://ip-172-30-3-95.ap-northeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1543390729790_0001/__spark_conf__.zip
18/11/28 07:41:33 INFO SecurityManager: Changing view acls to: hadoop
18/11/28 07:41:33 INFO SecurityManager: Changing modify acls to: hadoop
18/11/28 07:41:33 INFO SecurityManager: Changing view acls groups to: 
18/11/28 07:41:33 INFO SecurityManager: Changing modify acls groups to: 
18/11/28 07:41:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/11/28 07:41:33 INFO Client: Submitting application application_1543390729790_0001 to ResourceManager
18/11/28 07:41:33 INFO YarnClientImpl: Submitted application application_1543390729790_0001
18/11/28 07:41:34 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:34 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1543390893662
     final status: UNDEFINED
     tracking URL: http://ip-172-30-3-95.ap-northeast-1.compute.internal:20888/proxy/application_1543390729790_0001/
     user: hadoop
18/11/28 07:41:35 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:36 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:37 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:38 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:39 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:40 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:41 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:42 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:43 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:44 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:45 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:46 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:47 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:48 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:49 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:50 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:51 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:52 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:53 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:54 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:55 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:56 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:57 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:58 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:41:59 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:00 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:01 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:02 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:03 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:04 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:05 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:06 INFO Client: Application report for application_1543390729790_0001 (state: ACCEPTED)
18/11/28 07:42:07 INFO Client: Application report for application_1543390729790_0001 (state: FAILED)
18/11/28 07:42:07 INFO Client: 
     client token: N/A
     diagnostics: Application application_1543390729790_0001 failed 2 times due to AM Container for appattempt_1543390729790_0001_000002 exited with  exitCode: 15
For more detailed output, check application tracking page:http://ip-172-30-3-95.ap-northeast-1.compute.internal:8088/cluster/app/application_1543390729790_0001Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1543390729790_0001_02_000001
Exit code: 15
Stack trace: ExitCodeException exitCode=15: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
    at org.apache.hadoop.util.Shell.run(Shell.java:479)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 15
Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1543390893662
     final status: FAILED
     tracking URL: http://ip-172-30-3-95.ap-northeast-1.compute.internal:8088/cluster/app/application_1543390729790_0001
     user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1543390729790_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1168)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/28 07:42:07 INFO ShutdownHookManager: Shutdown hook called
18/11/28 07:42:07 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d10f886a-bf7b-4a0a-a91f-2f0353bb7b67
Command exiting with ret '1'

需要异常详细信息。。。（单行程序没有帮助！）和您正在尝试的示例数据这里至少有两个可能的故障点a）将数据压缩为输入b）在输出上合并（1）。这两个问题都在堆栈溢出上被多次讨论过，所以我建议首先搜索现有的答案-这里真的没有太多可以说的，这还没有说多次。谢谢你的回复。我将在这里发布我的完整日志。