试图通过AWS CLI将S3存储桶名称和文件夹路径作为命令行参数传递给EMR上的python脚本

试图通过AWS CLI将S3存储桶名称和文件夹路径作为命令行参数传递给EMR上的python脚本,python,amazon-web-services,apache-spark,amazon-s3,amazon-emr,Python,Amazon Web Services,Apache Spark,Amazon S3,Amazon Emr,我正试图通过AWS CLI将bucket名称和文件夹名称传递给python脚本,如下所示: aws emr添加步骤-群集id j-12XXXXXXXX2R-步骤类型=spark,名称=步骤0\u do\u something,Args=[-deploy mode,cluster,-conf,spark.warn.appExecutorEnv.PYTHON_EGG_CACHE=/tmp,-conf,spark.warn.appMasterEnv.PYTHON_EGG_CACHE=/tmp,-con

我正试图通过AWS CLI将bucket名称和文件夹名称传递给python脚本,如下所示:

aws emr添加步骤-群集id j-12XXXXXXXX2R-步骤类型=spark,名称=步骤0\u do\u something,Args=[-deploy mode,cluster,-conf,spark.warn.appExecutorEnv.PYTHON_EGG_CACHE=/tmp,-conf,spark.warn.appMasterEnv.PYTHON_EGG_CACHE=/tmp,-conf,spark.warn.submit.waitAppCompletion=true,-conf,spark.master=warn,--py文件,s3://com.some.bucketname/scripts/my_modules/s3://com.some.bucketname.bucketname/scripts/my_steps/steps/steps0_do_something.py com.other.bucketname an_input_filename.csv somefoldername/somesubfoldername],actionfailure=CANCEL_并_WAIT

我的方法是通过以下方式捕获脚本step0_do_something.py中的3个字符串“com.other.bucketname”、“an_input_filename.csv”和“somefoldername/somesubfoldername”:

if __name__ == '__main__':
    session = start_spark_session()
    print(sys.argv[0])
    print(sys.argv[1])
    print(sys.argv[2])
但是,我得到的只是以下错误消息:

Error parsing parameter '--steps': Expected: ',', received: 'EOF' for input:
Type=spark,Name=step0_do_something,Args=[--deploy-mode,cluster,--conf,spark.yarn.appExecutorEnv.PYTHON_EGG_CACHE=/tmp,--conf,spark.yarn.appMasterEnv.PYTHON_EGG_CACHE=/tmp,--conf,spark.executorEnv.PYTHON_EGG_CACHE=/tmp,--conf,spark.yarn.submit.waitAppCompletion=true,--conf,spark.master=yarn,--py-files,s3://com.some.bucketname/scripts/my_modules.egg,s3://com.some.bucketname/scripts/my_steps/step0_do_something.py
                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
可能值得一提的是,上面的aws emr add steps命令末尾没有3个参数'com.other.bucketname'、'an_input_filename.csv'和'somefoldername/somesubfoldername'是我用来将步骤提交到aws emr集群的命令,没有任何问题,但必须使用硬编码的bucket和文件夹名称。因此,我失败的att以这种方式添加一些参数肯定是错误的原因。我只是在aws文档页面或其他地方找不到关于如何实现这一点的任何描述。
非常感谢您的帮助!

只需将脚本名称和3个字符串之间的空格替换为逗号,即可完成以下工作:

aws emr add-steps --cluster-id j-12XXXXXXXXX2R --steps Type=spark,Name=step0_do_something,Args=[--deploy-mode,cluster,--conf,spark.yarn.appExecutorEnv.PYTHON_EGG_CACHE=/tmp,--conf,spark.yarn.appMasterEnv.PYTHON_EGG_CACHE=/tmp,--conf,spark.executorEnv.PYTHON_EGG_CACHE=/tmp,--conf,spark.yarn.submit.waitAppCompletion=true,--conf,spark.master=yarn,--py-files,s3://com.some.bucketname/scripts/my_modules.egg,s3://com.some.bucketname/scripts/my_steps/step0_do_something.py**,**com.another.bucketname**,**an_input_filename.csv**,**somefoldername/somesubfoldername],ActionOnFailure=CANCEL_AND_WAIT

那么,这些字符串是否存储在sys.argv[0]、…、sys.argv[2]中?是的,它们在指定的python脚本step0\u do\u something.py中传递给sys.argv[0]…等。