Python 3.x 如何使用portable runner和spark submit将beams wordcount python示例提交到EMR running Thread上的远程spark群集?

Python 3.x 如何使用portable runner和spark submit将beams wordcount python示例提交到EMR running Thread上的远程spark群集?,python-3.x,apache-spark,yarn,apache-beam,amazon-emr,Python 3.x,Apache Spark,Yarn,Apache Beam,Amazon Emr,我正试图将beams wordcount python示例提交给emr上的一个远程spark集群,该集群运行Thread作为其资源管理器。根据spark文档,这需要使用 按照portable runner的说明,我已经启动了作业服务端点,并且它似乎启动正确:: $ docker run --net=host apache/beam_spark_job_server:latest --spark-master-url=spark://*.***.***.***:7077 20/08/31 12:1

我正试图将beams wordcount python示例提交给emr上的一个远程spark集群,该集群运行Thread作为其资源管理器。根据spark文档,这需要使用

按照portable runner的说明,我已经启动了作业服务端点,并且它似乎启动正确::

$ docker run --net=host apache/beam_spark_job_server:latest --spark-master-url=spark://*.***.***.***:7077
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: ArtifactStagingService started on localhost:8098
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: Java ExpansionService started on localhost:8097
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: JobService started on localhost:8099
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: Job server now running, terminate with Ctrl+C
现在我尝试使用spark submit提交作业,输入是Sherlock Holmes的纯文本版本:

$ spark-submit --master=yarn --deploy-mode=cluster  wordcount.py --input data/sherlock.txt --output output --runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER --environment_config=apachebeam/python3.7_sdk
20/08/31 12:19:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/31 12:19:40 INFO RMProxy: Connecting to ResourceManager at ip-***-**-**-***.ec2.internal/***.**.**.***:8032
20/08/31 12:19:40 INFO Client: Requesting a new application from cluster with 2 NodeManagers
20/08/31 12:19:40 INFO Configuration: resource-types.xml not found
20/08/31 12:19:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
20/08/31 12:19:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
20/08/31 12:19:40 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/08/31 12:19:40 INFO Client: Setting up container launch context for our AM
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: /usr/lib/spark/python/lib/pyspark.zip not found; cannot run pyspark application in YARN mode.
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.deploy.yarn.Client.$anonfun$findPySparkArchives$2(Client.scala:1167)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.deploy.yarn.Client.findPySparkArchives(Client.scala:1163)
    at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:858)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:178)
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1134)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/08/31 12:19:40 INFO ShutdownHookManager: Shutdown hook called
20/08/31 12:19:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-ee751413-e29d-4b1f-8a16-fb8650b1ca10

它似乎希望安装pyspark,我对向spark集群提交beam作业还比较陌生,提交beam作业时需要安装pyspark有什么原因吗?我感觉我的spark submit命令有误,但我很难找到更多关于如何执行我正在尝试的操作的具体文档。

可能与此线程有关,也可能与此线程有关,