Python 输出文件未保存在AWS s3中的my bucket上_Python_Amazon Web Services_Amazon S3_Amazon Ec2_Pyspark

Python 输出文件未保存在AWS s3中的my bucket上

python amazon-web-services amazon-s3 amazon-ec2 pyspark

Python 输出文件未保存在AWS s3中的my bucket上,python,amazon-web-services,amazon-s3,amazon-ec2,pyspark,Python,Amazon Web Services,Amazon S3,Amazon Ec2,Pyspark,我正试图从AWS学习本教程。我正处于快速示例步骤。当我尝试运行该命令时： aws emr添加步骤--cluster id j-xxxxx--steps Type=spark，Name=SparkWordCountApp，Args=[--deploy mode，cluster，--master，thread，--conf，spark.thread.submit.waitAppCompletion=false，--num executors，5，--executor cores，5，--exec

我正试图从AWS学习本教程。我正处于快速示例步骤。

当我尝试运行该命令时：

aws emr添加步骤--cluster id j-xxxxx--steps Type=spark，Name=SparkWordCountApp，Args=[--deploy mode，cluster，--master，thread，--conf，spark.thread.submit.waitAppCompletion=false，--num executors，5，--executor cores，5，--executor memory，20g，s3://codelocation/wordcount.py，s3://inputbucket/input.txt，s3://outputbucket/]，ActionOnFailure=继续

我的输出文件不会出现在我的存储桶上，即使在EMR上，它表示作业已完成

SparkWordCountApp   Completed   2017-01-24 16:35 (UTC+1)    10 seconds

这是wordcount python文件：

from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: wordcount  ", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="WordCount")
    text_file = sc.textFile(sys.argv[1])
    counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
    counts.saveAsTextFile(sys.argv[2])
    sc.stop()

这是来自群集的日志文件：

17/01/25 14:40:19 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/01/25 14:40:19 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (20480+2048 MB) is above the max threshold (11520 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
    at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:304)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:164)
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1119)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1178)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'

我正在使用m3.x大型实例。

尝试将输出目录设置为子目录，而不是根目录。在不介绍EMR s3客户机的情况下，我知道Hadoop S3A one在过去遇到了一些与rename（）相关的问题，当时目标是bucket的根。否则，启动日志并查看com.aws模块打印的内容

我已将日志文件添加到我的问题中。

spark.executor.memory

的值是多少？从命令行看，它是20g。是的，您已经提到了，我错过了它。每m3.xlarge实例只有15g，但执行器请求20g+2g，而且纱线配置仅允许最大11.5g。你能不能把它减到8g，试着运行一下？@franklinsijo，我已经试过了。python文件执行得很好，但是我仍然没有输出文件。outputbucket已经创建了吗？你的input.txt不是空的，对吗？