.net DataProc不会解压缩作为存档传递的文件_.net_Apache Spark_Google Cloud Platform_Google Cloud Dataproc

.net DataProc不会解压缩作为存档传递的文件

.net apache-spark google-cloud-platform

.net DataProc不会解压缩作为存档传递的文件,.net,apache-spark,google-cloud-platform,google-cloud-dataproc,.net,Apache Spark,Google Cloud Platform,Google Cloud Dataproc,我正在尝试使用.NET spark作业提交DataProc 命令行如下所示： gcloud dataproc jobs submit spark \ --cluster=<cluster> \ --region=<region> \ --class=org.apache.spark.deploy.dotnet.DotnetRunner \ --jars=gs://bucket/microsoft-sp

我正在尝试使用.NET spark作业提交DataProc

命令行如下所示：

gcloud dataproc jobs submit spark \
         --cluster=<cluster> \
         --region=<region> \
         --class=org.apache.spark.deploy.dotnet.DotnetRunner \
         --jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
         --archives=gs://bucket/dotnet-build-output.zip \
         -- find

最终，GCP不会从指定为

--archives

的存储中解压缩文件。指定的文件存在，路径是从GCP UI复制的。我还尝试从存档（存在）中运行一个精确的程序集文件，但由于

文件不存在
我认为问题在于您的命令在主节点上运行的Spark驱动程序中运行，因为默认情况下Dataproc在客户机模式下运行。提交作业时，您可以通过添加--properties spark.submit.deployMode=cluster
来更改它
根据--archives
标志的使用帮助：
存档将仅提取到工作节点中。我测试了使用--archives=gs://my bucket/foo.zip
提交作业，其中包括两个文件foo.txt
和deps.txt
，然后我可以在工作节点上找到提取的文件：
my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/

total 4
-r-x------ 1 yarn yarn 11 Jul  2 22:09 deps.txt
-r-x------ 1 yarn yarn  0 Jul  2 22:09 foo.txt

正如@dagang提到的--archives
和--files
参数不会将zip文件复制到驱动程序实例，因此这是错误的方向
我采用了这种方法：
gcloud dataproc jobs submit spark \
        --cluster=<cluster> \
        --region=<region> \
        --class=org.apache.spark.deploy.dotnet.DotnetRunner \
        --jars=gs://<bucket>/microsoft-spark-2.4.x-0.11.0.jar \
        -- /bin/sh -c "gsutil cp gs://<bucket>/builds/test.zip . && unzip -n test.zip && chmod +x ./Spark.Job.Test && ./Spark.Job.Test"

gcloud dataproc作业提交spark\
--群集=\
--地区=\
--class=org.apache.spark.deploy.dotnet.DotnetRunner\
--jars=gs:///microsoft-spark-2.4.x-0.11.0.jar\
--/bin/sh-c“gsutil cp gs:///builds/test.zip.&&unzip-n test.zip&&chmod+x./Spark.Job.test&./Spark.Job.test”
要检查此参数。这很奇怪，如果存档文件是。。。看起来cli使用了不同的凭据来处理归档文件和jars文件。
my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/

total 4
-r-x------ 1 yarn yarn 11 Jul  2 22:09 deps.txt
-r-x------ 1 yarn yarn  0 Jul  2 22:09 foo.txt

gcloud dataproc jobs submit spark \
        --cluster=<cluster> \
        --region=<region> \
        --class=org.apache.spark.deploy.dotnet.DotnetRunner \
        --jars=gs://<bucket>/microsoft-spark-2.4.x-0.11.0.jar \
        -- /bin/sh -c "gsutil cp gs://<bucket>/builds/test.zip . && unzip -n test.zip && chmod +x ./Spark.Job.Test && ./Spark.Job.Test"