Apache spark 针对长运行作业和多个小运行作业的EMR中的资源优化/利用_Apache Spark_Hadoop_Yarn_Amazon Emr_Long Running Processes

Apache spark 针对长运行作业和多个小运行作业的EMR中的资源优化/利用

apache-spark hadoop

Apache spark 针对长运行作业和多个小运行作业的EMR中的资源优化/利用,apache-spark,hadoop,yarn,amazon-emr,long-running-processes,Apache Spark,Hadoop,Yarn,Amazon Emr,Long Running Processes,我的用例：我们有一个长期的工作。在此之后调用，LRJ。此作业每周运行一次我们有多个小的运行作业，可以在任何时候来。这些作业的优先级高于长时间运行的作业为了解决这个问题，我们创建了如下纱线队列： yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator yarn.scheduler.capacit

我的用例：

我们有一个长期的工作。在此之后调用，LRJ。此作业每周运行一次
我们有多个小的运行作业，可以在任何时候来。这些作业的优先级高于长时间运行的作业

为了解决这个问题，我们创建了如下纱线队列：

        yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
        yarn.scheduler.capacity.root.queues: Q1,Q2
        yarn.scheduler.capacity.root.Q2.capacity: 60
        yarn.scheduler.capacity.root.Q1.capacity: 40
        yarn.scheduler.capacity.root.Q2.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.Q1.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100
        yarn.scheduler.capacity.root.Q2.accessible-node-labels.CORE.capacity: 60
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.capacity: 40
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.maximum-capacity: 60
        yarn.scheduler.capacity.root.Q2.disable_preemption: true
        yarn.scheduler.capacity.root.Q1.disable_preemption: false

        yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
        yarn.resourcemanager.scheduler.monitor.enable: true
        yarn.resourcemanager.scheduler.monitor.policies: org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
        yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval: 2000
        yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill: 3000
        yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round: 0.5
        yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity: 0.1
        yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor: 1

已创建用于资源管理的队列。为长时间运行的作业配置了Q1队列，为小时间运行的作业配置了Q2队列

Config:
     Q1 : capacity = 50% and it can go upto 100%
          capacity on CORE nodes = 50% and maximum 100%   
     Q2 : capacity = 50% and it can go upto 100%
          capacity on CORE nodes = 50% and maximum 100%

我们面临的问题：

        yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
        yarn.scheduler.capacity.root.queues: Q1,Q2
        yarn.scheduler.capacity.root.Q2.capacity: 60
        yarn.scheduler.capacity.root.Q1.capacity: 40
        yarn.scheduler.capacity.root.Q2.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.Q1.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100
        yarn.scheduler.capacity.root.Q2.accessible-node-labels.CORE.capacity: 60
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.capacity: 40
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.maximum-capacity: 60
        yarn.scheduler.capacity.root.Q2.disable_preemption: true
        yarn.scheduler.capacity.root.Q1.disable_preemption: false

        yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
        yarn.resourcemanager.scheduler.monitor.enable: true
        yarn.resourcemanager.scheduler.monitor.policies: org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
        yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval: 2000
        yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill: 3000
        yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round: 0.5
        yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity: 0.1
        yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor: 1

当LRJ正在进行时，它将获取所有资源。当LRJ获得所有资源时，多个正在运行的小作业将等待。一旦集群扩展，新资源可用，运行中的小作业就会获得资源。然而，由于集群需要时间来扩展活动，这在为这些作业分配资源时造成了很大的延迟

更新1:

我们已经尝试根据使用

最大容量

配置，但它没有像我在另一个问题中发布的那样起作用

经过更多的分析，包括与一些无名英雄的讨论，我们决定根据我们的用例对纱线队列应用抢占
当发生以下事件序列时，Q1队列上的作业将被抢占：

Q1队列使用的容量超过了指定的容量（例如：LRJ作业正在使用的资源多于队列上指定的资源）

突然，Q2队列上的作业被调度（例如：突然触发多个正在运行的小作业）
要了解抢占权，请阅读并
以下是我们在AWS CloudFormation脚本中用于启动EMR群集的示例配置：
容量计划程序配置：

yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator yarn.scheduler.capacity.root.queues: Q1,Q2 yarn.scheduler.capacity.root.Q2.capacity: 60 yarn.scheduler.capacity.root.Q1.capacity: 40 yarn.scheduler.capacity.root.Q2.accessible-node-labels: "*" yarn.scheduler.capacity.root.Q1.accessible-node-labels: "*" yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100 yarn.scheduler.capacity.root.Q2.accessible-node-labels.CORE.capacity: 60 yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.capacity: 40 yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.maximum-capacity: 60 yarn.scheduler.capacity.root.Q2.disable_preemption: true yarn.scheduler.capacity.root.Q1.disable_preemption: false

yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.resourcemanager.scheduler.monitor.enable: true yarn.resourcemanager.scheduler.monitor.policies: org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval: 2000 yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill: 3000 yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round: 0.5 yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity: 0.1 yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor: 1
纱线站点配置：

yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator yarn.scheduler.capacity.root.queues: Q1,Q2 yarn.scheduler.capacity.root.Q2.capacity: 60 yarn.scheduler.capacity.root.Q1.capacity: 40 yarn.scheduler.capacity.root.Q2.accessible-node-labels: "*" yarn.scheduler.capacity.root.Q1.accessible-node-labels: "*" yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100 yarn.scheduler.capacity.root.Q2.accessible-node-labels.CORE.capacity: 60 yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.capacity: 40 yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.maximum-capacity: 60 yarn.scheduler.capacity.root.Q2.disable_preemption: true yarn.scheduler.capacity.root.Q1.disable_preemption: false

yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.resourcemanager.scheduler.monitor.enable: true yarn.resourcemanager.scheduler.monitor.policies: org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval: 2000 yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill: 3000 yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round: 0.5 yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity: 0.1 yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor: 1
使用上述方法，您必须根据您的用例在特定队列上指定作业