Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark:初始作业未接受任何资源_Apache Spark_Pyspark - Fatal编程技术网

Apache spark Spark:初始作业未接受任何资源

Apache spark Spark:初始作业未接受任何资源,apache-spark,pyspark,Apache Spark,Pyspark,我在4台不同的机器上安装了一个Spark群集。每台机器都有7.7gb内存和8核i7处理器。我正在使用Pyspark并尝试将5个numpy阵列(每个2.9gb)加载到集群中。它们都是我在另一台机器上生成的更大的14gb numpy阵列的一部分。我试图在第一个rdd上运行一个简单的计数函数,以确保集群正常运行。我在执行时收到以下警告: >>> import numpy as np >>> gen1 = sc.parallelize(np.load('/home/h

我在4台不同的机器上安装了一个Spark群集。每台机器都有7.7gb内存和8核i7处理器。我正在使用Pyspark并尝试将5个numpy阵列(每个2.9gb)加载到集群中。它们都是我在另一台机器上生成的更大的14gb numpy阵列的一部分。我试图在第一个rdd上运行一个简单的计数函数,以确保集群正常运行。我在执行时收到以下警告:

>>> import numpy as np
>>> gen1 = sc.parallelize(np.load('/home/hduser/gen1.npy'),512)
>>> gen1.count()
[Stage 0:>                                                        (0 + 0) / 512]
17/01/28 13:07:07 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/01/28 13:07:22 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/01/28 13:07:37 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[Stage 0:>                                                        (0 + 0) / 512]
17/01/28 13:07:52 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
^C
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/python/pyspark/rdd.py", line 1008, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/opt/spark/python/pyspark/rdd.py", line 999, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/opt/spark/python/pyspark/rdd.py", line 873, in fold
    vals = self.mapPartitions(func).collect()
  File "/opt/spark/python/pyspark/rdd.py", line 776, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 931, in __call__
  File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 695, in send_command
  File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 828, in send_command
  File "/home/hduser/anaconda2/lib/python2.7/socket.py", line 451, in readline
    data = self._sock.recv(self._rbufsize)
  File "/opt/spark/python/pyspark/context.py", line 223, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
这些设置在每台工作机器上都是相同的

spark-defaults.conf(主)中的我的设置:

每个辅助进程只有
spark.master
spark.serializer
配置选项设置如上所述

我还需要弄清楚如何优化我的内存管理,因为在这个问题出现之前,我本应该有足够的内存,但Java堆空间异常却被左右抛出。但也许我会把它留给另一个问题


请帮忙

如果您可以在web UI中找到spark Slave,但它们不接受作业,则很可能是防火墙阻止了通信


您可以像我的另一个答案中那样进行测试:

如果您可以在web UI中找到spark Slave,但他们不接受作业,那么很有可能防火墙正在阻止通信

你可以像我的另一个答案一样做一个测试:

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_MASTER_IP=192.168.1.2
spark.master    spark://lebron:7077
spark.serializer    org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled   true
spark.kryoserializer.buffer.max 128m