Apache spark getExecutorMemoryStatus（）.size（）未输出正确的执行器数_Apache Spark_Pyspark_Slurm

Apache spark getExecutorMemoryStatus（）.size（）未输出正确的执行器数

apache-spark pyspark

Apache spark getExecutorMemoryStatus（）.size（）未输出正确的执行器数,apache-spark,pyspark,slurm,Apache Spark,Pyspark,Slurm,简而言之，我需要Spark集群中的执行者/工作者的数量，但使用sc.。_jsc.sc（）.getExecutorMemoryStatus（）.size（）给我1，而实际上有12个执行者关于更多细节，我正在尝试确定执行者的数量，并将该数量用作我要求Spark分发RDD的分区数量。我这样做是为了利用并行性，因为我的初始数据只是一个数字范围，但它们中的每一个都会在rdd#foreach方法中得到处理。这个过程既有内存方面的问题，又有计算方面的问题，所以我希望最初的数字范围与执行器一样位于多个分区中，

简而言之，我需要Spark集群中的执行者/工作者的数量，但使用

sc.。_jsc.sc（）.getExecutorMemoryStatus（）.size（）

给我1，而实际上有12个执行者

关于更多细节，我正在尝试确定执行者的数量，并将该数量用作我要求Spark分发RDD的分区数量。我这样做是为了利用并行性，因为我的初始数据只是一个数字范围，但它们中的每一个都会在

rdd#foreach

方法中得到处理。这个过程既有内存方面的问题，又有计算方面的问题，所以我希望最初的数字范围与执行器一样位于多个分区中，以允许所有执行器同时处理其中的数据块

阅读中的注释并查看scala的

getExecutorMemoryStatus

，建议使用的命令：

sc.\u jsc.sc（）.getExecutorMemoryStatus（）.size（）

。但出于某种原因，无论实际存在多少执行人，我都会得到一个答案1（在我上一次运行中，是12个）

我做错什么了吗？我打错电话了吗？以错误的方式

我在一个独立的Spark群集上运行，该群集每次都会启动以运行应用程序

以下是该问题的一个最小示例：

from pyspark import SparkConf, SparkContext
import datetime


def print_debug(msg):
    dbg_identifier = 'dbg_et '
    print(dbg_identifier + str(datetime.datetime.now()) + ':  ' + msg)


print_debug('*****************before configuring sparkContext')
conf = SparkConf().setAppName("reproducing_bug_not_all_executors_working")
sc = SparkContext(conf=conf)
print_debug('*****************after configuring sparkContext')


def main():
    executors_num = sc._jsc.sc().getExecutorMemoryStatus().size()
    list_rdd = sc.parallelize([1, 2, 3, 4, 5], executors_num)
    print_debug('line before loop_a_lot. Number of partitions created={0}, 
        while number of executors is {1}'
          .format(list_rdd.getNumPartitions(), executors_num))
    list_rdd.foreach(loop_a_lot)
    print_debug('line after loop_a_lot')


def loop_a_lot(x):
    y = x
    print_debug('started working on item %d at ' % x + str(datetime.datetime.now()))
    for i in range(100000000):
        y = y*y/6+5
    print_debug('--------------------finished working on item %d at ' % x + str(datetime.datetime.now())
      + 'with a result: %.3f' % y)

if __name__ == "__main__":
    main()

为了说明问题——在我上次运行它时，在驱动程序的输出中（仅粘贴相关部分、占位符，而不是真正的IP和端口）：

有人能帮我了解问题的原因吗？有什么想法吗？可能是因为口吃吗？（正如你所看到的那样，我

grep

编辑了驱动程序的输出文件-我在Slurm上运行Spark，因为我可以访问的集群是由它管理的）

短期修复：在使用

defaultParallelism

或

\u jsc.sc（）.getMemoryStatus（）之前允许时间（例如添加sleep
命令）

如果在应用程序执行开始时使用

说明： 启动时似乎只有一个执行者（我认为单个执行者就是驱动者，在某些情况下被视为执行者）的时间很短。这就是为什么在主函数顶部使用

sc.\u jsc.sc（）.getExecutorMemoryStatus（）

为我生成了错误的数字。同样的情况也发生在

defaultParallelism

（1）上

我的怀疑是，在所有工人都连接到司机之前，司机开始以工人的身份工作。它同意以下事实，即使用

--total executor cores 12将以下代码提交给spark submit

import time

conf = SparkConf().setAppName("app_name")
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger("dbg_et")

log.warn('defaultParallelism={0}, and size of executorMemoryStatus={1}'.format(sc.defaultParallelism,
           sc._jsc.sc().getExecutorMemoryStatus().size()))
time.sleep(15)
log.warn('After 15 seconds: defaultParallelism={0}, and size of executorMemoryStatus={1}'
          .format(sc.defaultParallelism, 
                  sc._jsc.sc().getExecutorMemoryStatus().size()))
rdd_collected = (sc.parallelize([1, 2, 3, 4, 5] * 200, 
spark_context_holder.getParallelismAlternative()*3)
             .map(lambda x: (x, x*x) * 2)
             .map(lambda x: x[2] + x[1])
             )
log.warn('Made rdd with {0} partitioned. About to collect.'
          .format(rdd_collected.getNumPartitions()))
rdd_collected.collect()
log.warn('And after rdd operations: defaultParallelism={0}, and size of executorMemoryStatus={1}'
          .format(sc.defaultParallelism,
                  sc._jsc.sc().getExecutorMemoryStatus().size()))

给了我以下输出
> tail -n 4 slurm-<job number>.out
18/09/26 13:23:52 WARN dbg_et: defaultParallelism=2, and size of executorMemoryStatus=1
18/09/26 13:24:07 WARN dbg_et: After 15 seconds: defaultParallelism=12, and size of executorMemoryStatus=13
18/09/26 13:24:07 WARN dbg_et: Made rdd with 36 partitioned. About to collect.
18/09/26 13:24:11 WARN dbg_et: And after rdd operations: defaultParallelism=12, and size of executorMemoryStatus=13

（1） 在开始使用getExecutorMemoryStatus（）
之前，我尝试使用defaultParallelism
，这是您应该使用的，但它一直给我2号。现在我明白这是出于同样的原因。在独立群集上运行时，如果驱动程序只看到1个执行器，则可以在forspark.default.parallelism
中看到defaultParallelism=2

（2） 我不确定在创建目录之前，这些值是如何正确的，但我假设执行器的启动顺序在创建目录之前将它们连接到驱动程序
import time

conf = SparkConf().setAppName("app_name")
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger("dbg_et")

log.warn('defaultParallelism={0}, and size of executorMemoryStatus={1}'.format(sc.defaultParallelism,
           sc._jsc.sc().getExecutorMemoryStatus().size()))
time.sleep(15)
log.warn('After 15 seconds: defaultParallelism={0}, and size of executorMemoryStatus={1}'
          .format(sc.defaultParallelism, 
                  sc._jsc.sc().getExecutorMemoryStatus().size()))
rdd_collected = (sc.parallelize([1, 2, 3, 4, 5] * 200, 
spark_context_holder.getParallelismAlternative()*3)
             .map(lambda x: (x, x*x) * 2)
             .map(lambda x: x[2] + x[1])
             )
log.warn('Made rdd with {0} partitioned. About to collect.'
          .format(rdd_collected.getNumPartitions()))
rdd_collected.collect()
log.warn('And after rdd operations: defaultParallelism={0}, and size of executorMemoryStatus={1}'
          .format(sc.defaultParallelism,
                  sc._jsc.sc().getExecutorMemoryStatus().size()))

> tail -n 4 slurm-<job number>.out
18/09/26 13:23:52 WARN dbg_et: defaultParallelism=2, and size of executorMemoryStatus=1
18/09/26 13:24:07 WARN dbg_et: After 15 seconds: defaultParallelism=12, and size of executorMemoryStatus=13
18/09/26 13:24:07 WARN dbg_et: Made rdd with 36 partitioned. About to collect.
18/09/26 13:24:11 WARN dbg_et: And after rdd operations: defaultParallelism=12, and size of executorMemoryStatus=13

 > ls -l --time-style=full-iso spark/worker_dir/app-20180926132351-0000/
 <permission user blah> 2018-09-26 13:24:08.909960000 +0300 0/
 <permission user blah> 2018-09-26 13:24:08.665098000 +0300 1/
 <permission user blah> 2018-09-26 13:24:08.912871000 +0300 10/
 <permission user blah> 2018-09-26 13:24:08.769355000 +0300 11/
 <permission user blah> 2018-09-26 13:24:08.931957000 +0300 2/
 <permission user blah> 2018-09-26 13:24:09.019684000 +0300 3/
 <permission user blah> 2018-09-26 13:24:09.138645000 +0300 4/
 <permission user blah> 2018-09-26 13:24:08.757164000 +0300 5/
 <permission user blah> 2018-09-26 13:24:08.996918000 +0300 6/
 <permission user blah> 2018-09-26 13:24:08.640369000 +0300 7/
 <permission user blah> 2018-09-26 13:24:08.846769000 +0300 8/
 <permission user blah> 2018-09-26 13:24:09.152162000 +0300 9/