Apache spark 如何将文件上载到Amazon EMR？_Apache Spark_Pyspark_Amazon Emr

Apache spark 如何将文件上载到Amazon EMR？

apache-spark pyspark

Apache spark 如何将文件上载到Amazon EMR？,apache-spark,pyspark,amazon-emr,Apache Spark,Pyspark,Amazon Emr,我的代码如下： # test2.py from pyspark import SparkContext, SparkConf, SparkFiles conf = SparkConf() sc = SparkContext( appName="test", conf=conf) from pyspark.sql import SQLContext sqlc = SQLContext(sparkContext=sc) with open(SparkFiles.

我的代码如下：

# test2.py

from pyspark import SparkContext, SparkConf, SparkFiles
conf = SparkConf()
sc = SparkContext(
    appName="test",
    conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
with open(SparkFiles.get("test_warc.txt")) as f:
  print("opened")
sc.stop()

当我在本地运行它时，它可以工作：

spark-submit --deploy-mode client --files ../input/test_warc.txt test2.py

但在向EMR碎屑器添加步骤后：

spark-submit --deploy-mode cluster --files s3://brand17-stock-prediction/test_warc.txt s3://brand17-stock-prediction/test2.py

我得到一个错误：

FileNotFoundError:[Errno 2]没有这样的文件或目录： “/mnt1/warn/usercache/hadoop/appcache/application_1618078674774_0001/spark-e7c93ba0-7d30-4e52-8f1b-14dda6ff599c/userFiles-5bb8ea9f-189d-4256-803f-0414209e7862/test_warc.txt”

文件的路径是正确的，但由于某些原因，它不是从s3上传的

我试图从executor加载：

from pyspark import SparkContext, SparkConf, SparkFiles
from operator import add

conf = SparkConf()
sc = SparkContext(
    appName="test",
    conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
def f(_):
    a = 0
    with open(SparkFiles.get("test_warc.txt")) as f:
      a += 1
      print("opened")
    return a#test_module.test()
count = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(count) # printing 2

sc.stop()

它工作正常，没有错误

看起来像

--文件

参数仅将文件上载到执行器。如何上传到master？

您的理解是正确的

--文件参数仅将文件上载到执行器

请参见spark文档中的这一点

file:-绝对路径和file:/uri由驱动程序的HTTP提供服务文件服务器，每个执行器从HTTP驱动程序中提取文件服务器

你可以在

现在回到你的第二个问题

如何上传到master

EMR中有一个引导动作的概念。在官方文件中，其含义如下：
您可以使用引导操作安装其他软件或自定义群集实例的配置。引导动作是在Amazon EMR启动实例后在集群上运行的脚本使用Amazon机器映像（AMI）。引导动作在Amazon EMR安装您指定的应用程序之前运行在群集节点开始处理数据之前创建群集
在我的案例中如何使用它？
生成集群时，您可以在
bootstrapacations
JSON中指定脚本，类似于以下内容以及其他自定义配置：

BootstrapActions=[ {'Name': 'Setup Environment for downloading my script', 'ScriptBootstrapAction': { 'Path': 's3://your-bucket-name/path-to-custom-scripts/download-file.sh' } }]
下载文件.sh的内容应如下所示：

#!/bin/bash set -x workingDir=/opt/your-path/ sudo mkdir -p $workingDir sudo aws s3 cp s3://your-bucket/path-to-your-file/test_warc.txt $workingDir
现在，在python脚本中，可以使用文件
workingDir/test\u warc.txt
读取该文件
还有一个选项，可以仅在主节点/任务节点上执行引导操作，也可以同时在这两个节点上执行引导操作<代码>引导操作/运行，如果是我们可以用于本例的脚本。更多关于这方面的阅读可以在