Amazon web services 在AWS EMR上运行Spark，如何在主节点上运行驱动程序？_Amazon Web Services_Apache Spark_Emr

Amazon web services 在AWS EMR上运行Spark，如何在主节点上运行驱动程序？

amazon-web-services apache-spark

Amazon web services 在AWS EMR上运行Spark，如何在主节点上运行驱动程序？,amazon-web-services,apache-spark,emr,Amazon Web Services,Apache Spark,Emr,似乎默认情况下，EMR将Spark驱动程序部署到一个核心节点，导致主节点实际上未被利用。是否可以在主节点上运行驱动程序？我用--deploy mode参数进行了试验，但没有效果以下是我的实例组JSON定义： [ { "InstanceGroupType": "MASTER", "InstanceCount": 1, "InstanceType": "m3.xlarge", "Name": "Spark Master" }, { "Instan

似乎默认情况下，EMR将Spark驱动程序部署到一个核心节点，导致主节点实际上未被利用。是否可以在主节点上运行驱动程序？我用

--deploy mode

参数进行了试验，但没有效果

以下是我的实例组JSON定义：

[
  {
    "InstanceGroupType": "MASTER",
    "InstanceCount": 1,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Master"
  },
  {
    "InstanceGroupType": "CORE",
    "InstanceCount": 3,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Executors"
  }
]

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    },
    "Configurations": []
  },
  {
    "Classification": "spark-env",
    "Properties": {
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        },
        "Configurations": [
        ]
      }
    ]
  }
]

[
  {
    "Name": "example",
    "Type": "SPARK",
    "Args": [
      "--class", "com.name.of.Class",
      "/home/hadoop/myjar-assembly-1.0.jar"
    ],
    "ActionOnFailure": "TERMINATE_CLUSTER"
  }
]

以下是我的JSON定义：

[
  {
    "InstanceGroupType": "MASTER",
    "InstanceCount": 1,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Master"
  },
  {
    "InstanceGroupType": "CORE",
    "InstanceCount": 3,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Executors"
  }
]

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    },
    "Configurations": []
  },
  {
    "Classification": "spark-env",
    "Properties": {
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        },
        "Configurations": [
        ]
      }
    ]
  }
]

[
  {
    "Name": "example",
    "Type": "SPARK",
    "Args": [
      "--class", "com.name.of.Class",
      "/home/hadoop/myjar-assembly-1.0.jar"
    ],
    "ActionOnFailure": "TERMINATE_CLUSTER"
  }
]

以下是我的JSON定义：

[
  {
    "InstanceGroupType": "MASTER",
    "InstanceCount": 1,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Master"
  },
  {
    "InstanceGroupType": "CORE",
    "InstanceCount": 3,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Executors"
  }
]

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    },
    "Configurations": []
  },
  {
    "Classification": "spark-env",
    "Properties": {
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        },
        "Configurations": [
        ]
      }
    ]
  }
]

[
  {
    "Name": "example",
    "Type": "SPARK",
    "Args": [
      "--class", "com.name.of.Class",
      "/home/hadoop/myjar-assembly-1.0.jar"
    ],
    "ActionOnFailure": "TERMINATE_CLUSTER"
  }
]

我正在使用

aws emr create cluster

和

--释放标签emr-4.3.0

设置驱动程序的位置

使用spark submit，可以使用flag--deploy模式选择驱动程序的位置

当您正在调试并希望快速查看应用程序的输出时，以客户端模式提交应用程序是非常有利的。对于生产中的应用程序，最佳实践是以集群模式运行应用程序。此模式可保证驱动程序在应用程序执行期间始终可用。但是，如果您确实使用客户端模式，并且从EMR群集外部（例如本地、笔记本电脑上）提交应用程序，请记住，驱动程序正在EMR群集外部运行，并且驱动程序执行器通信的延迟会更高

我不认为这是浪费。在EMR上运行Spark时，主节点将运行Thread RM、Livy Server以及您选择的其他应用程序。如果在客户机模式下运行，大多数驱动程序也将在主节点上运行

请注意，驱动程序可能比执行器上的任务重，例如，从所有执行器收集所有结果，在这种情况下，如果主节点是驱动程序运行的地方，则需要为其分配足够的资源。

据我所知，答案是否定的。主节点的唯一责任似乎是运行纱线。我想也许我可以通过将Spark.executor.instances设置为高于节点数来让一个从节点运行Spark主节点和一个executor，但它不起作用。这是纱线上的Spark的本质。如果将部署模式设置为客户端，则驱动程序将在主模式下运行，并且只有一个小的应用程序主节点将在从节点上运行。此外，如果您放弃maximizeResourceAllocation，并为驱动程序、执行器和应用程序主控程序指定您想要的内容（基本上压缩这个），那么您可以根据应用程序的需要调整集群。甚至可以尝试动态资源分配。相当浪费。