Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在spark作业中运行python脚本?_Python_Apache Spark - Fatal编程技术网

如何在spark作业中运行python脚本?

如何在spark作业中运行python脚本?,python,apache-spark,Python,Apache Spark,我用tar文件方法在3台机器上安装了spark。我没有做任何高级配置,我编辑了从属文件并启动了主控和辅助。我能在8080端口上看到斯巴库。现在我想在spark集群上运行简单的python脚本 import sys from random import random from operator import add from pyspark import SparkContext if __name__ == "__main__": """ Usage: pi [pa

我用tar文件方法在3台机器上安装了spark。我没有做任何高级配置,我编辑了从属文件并启动了主控和辅助。我能在8080端口上看到斯巴库。现在我想在spark集群上运行简单的python脚本

import sys
from random import random
from operator import add

from pyspark import SparkContext


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    sc = SparkContext(appName="PythonPi")
    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 < 1 else 0

    count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
    print "Pi is roughly %f" % (4.0 * count / n)

    sc.stop()
导入系统 从随机导入随机 从操作员导入添加 从pyspark导入SparkContext 如果名称=“\uuuuu main\uuuuuuuu”: """ 用法:pi[分区] """ sc=SparkContext(appName=“PythonPi”) 如果len(sys.argv)>1,则partitions=int(sys.argv[1]),否则为2 n=100000*个分区 def f(u3;): x=random()*2-1 y=random()*2-1 如果x**2+y**2<1,则返回1,否则返回0 count=sc.parallelize(xrange(1,n+1),partitions).map(f).reduce(add) 打印“Pi大约为%f”%(4.0*计数/n) sc.停止() 我正在运行这个命令

spark提交-主spark://IP:7077 pi.py 1

但是得到下面的错误

14/12/22 18:31:23 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/12/22 18:31:38 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/12/22 18:31:43 INFO client.AppClient$ClientActor: Connecting to master spark://10.77.36.243:7077...
14/12/22 18:31:53 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/12/22 18:32:03 INFO client.AppClient$ClientActor: Connecting to master spark://10.77.36.243:7077...
14/12/22 18:32:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/12/22 18:32:23 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
14/12/22 18:32:23 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/12/22 18:32:23 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
14/12/22 18:32:23 INFO scheduler.DAGScheduler: Failed to run reduce at /opt/pi.py:21
Traceback (most recent call last):
  File "/opt/pi.py", line 21, in <module>
    count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
  File "/usr/local/spark/python/pyspark/rdd.py", line 759, in reduce
    vals = self.mapPartitions(func).collect()
  File "/usr/local/spark/python/pyspark/rdd.py", line 723, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o26.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
14/12/22 18:31:23信息调度器.TaskSchedulerImpl:添加任务集0.0和1个任务
14/12/22 18:31:38 WARN scheduler.TaskSchedulerImpl:初始作业未接受任何资源;检查集群UI以确保已注册工作进程并具有足够的内存
14/12/22 18:31:43信息客户端。AppClient$ClientActor:连接到主服务器spark://10.77.36.243:7077...
14/12/22 18:31:53 WARN scheduler.TaskSchedulerImpl:初始作业未接受任何资源;检查集群UI以确保已注册工作进程并具有足够的内存
14/12/22 18:32:03信息客户端。AppClient$ClientActor:连接到主服务器spark://10.77.36.243:7077...
14/12/22 18:32:08 WARN scheduler.TaskSchedulerImpl:初始作业未接受任何资源;检查集群UI以确保已注册工作进程并具有足够的内存
14/12/22 18:32:23错误cluster.SparkDeploySchedulerBackend:应用程序已被终止。原因:所有的主人都没有反应!放弃。
14/12/22 18:32:23 INFO scheduler.TaskSchedulerImpl:已从池中删除任务集0.0,其任务已全部完成
14/12/22 18:32:23信息调度器.TaskSchedulerImpl:取消阶段0
14/12/22 18:32:23 INFO scheduler.DAGScheduler:无法在/opt/pi处运行reduce。py:21
回溯(最近一次呼叫最后一次):
文件“/opt/pi.py”,第21行,在
count=sc.parallelize(xrange(1,n+1),partitions).map(f).reduce(add)
文件“/usr/local/spark/python/pyspark/rdd.py”,第759行,在reduce中
vals=self.mapPartitions(func.collect())
文件“/usr/local/spark/python/pyspark/rdd.py”,第723行,在collect中
bytesInJava=self.\u jrdd.collect().iterator()
文件“/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py”,第538行,在调用中__
文件“/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py”,第300行,在get_return_值中
py4j.protocol.Py4JJavaError:调用o26.collect时出错。
:org.apache.spark.sparkeexception:由于阶段失败,作业中止:所有主控设备均无响应!放弃。
位于org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
位于scala.collection.mutable.resizeblearray$class.foreach(resizeblearray.scala:59)
位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
没有人面临同样的问题。请在此提供帮助。

此:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
表示群集没有任何可用资源

检查集群的状态并检查核心和RAM()

此外,请仔细检查您的IP地址

欲了解更多想法: