Apache spark Pyspark内存问题
我正在运行一个多次涉及spark并行化的程序。该程序在最初几次迭代中运行正常,但由于内存问题而崩溃。我正在使用Python2.7中的Spark 2.2.0,并在AWS EC2上使用30g内存运行测试 以下是我的火花设置:Apache spark Pyspark内存问题,apache-spark,amazon-ec2,pyspark,Apache Spark,Amazon Ec2,Pyspark,我正在运行一个多次涉及spark并行化的程序。该程序在最初几次迭代中运行正常,但由于内存问题而崩溃。我正在使用Python2.7中的Spark 2.2.0,并在AWS EC2上使用30g内存运行测试 以下是我的火花设置: conf = pyspark.SparkConf() conf.set("spark.executor.memory", '4g') conf.set('spark.executor.cores', '16') conf.set('spark.cores.max', '16')
conf = pyspark.SparkConf()
conf.set("spark.executor.memory", '4g')
conf.set('spark.executor.cores', '16')
conf.set('spark.cores.max', '16')
conf.set("spark.driver.memory",'4g')
conf.setMaster("local[*]")
这是我的错误日志:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda2\lib\site-packages\flask\app.py", line
1982,
in wsgi_app
response = self.full_dispatch_request()
File "C:\ProgramData\Anaconda2\lib\site-packages\flask\app.py", line
1614,
in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\ProgramData\Anaconda2\lib\site-packages\flask\app.py", line
1517,
in handle_user_exception
reraise(exc_type, exc_value, tb)
File "C:\ProgramData\Anaconda2\lib\site-packages\flask\app.py", line
1612,
in full_dispatch_request
rv = self.dispatch_request()
File "C:\ProgramData\Anaconda2\lib\site-packages\flask\app.py", line
1598,
in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File
"C:/Users/Administrator/Desktop/Flex_Api_Post/
flex_api_post_func_spark_setup.py", line 152, in travel_time_est
count = ssc.parallelize(input_json).map(lambda j:
flex_func(j)).collect()
File "C:\ProgramData\Anaconda2\lib\site-packages\pyspark\rdd.py", line
809,
in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\ProgramData\Anaconda2\lib\site-packages\py4j\java_gateway.py",
line
1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\ProgramData\Anaconda2\lib\site-packages\py4j\protocol.py", line
320, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to
stage failure: Task 7
in stage 13.0 failed 1 times, most recent failure:
Lost task 7.0 in stage
13.0 (TID 215, localhost, executor driver):
org.apache.spark.api.python.PythonException:
Traceback (most recent call
last):
File "C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 166, in main
File "C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 57, in
read_command
File "C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 454, in loads
return pickle.loads(obj)
MemoryError
at
org.apache.spark.api.python.PythonRunner$$anon$1.
read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>
(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at
org.apache.spark.scheduler.DAGScheduler
$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.
DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.
mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.
DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.
DAGScheduler$$anonfun$handleTaskSetFailed$1.apply
(DAGScheduler.scala:814)
at org.apache.spark.scheduler.
DAGScheduler$$anonfun$handleTaskSetFailed$1.apply
(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.
DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.
DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.
DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.
DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.
runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.
withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.
withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.api.python.PythonRDD$.
collectAndServe(PythonRDD.scala:458)
at org.apache.spark.api.python.PythonRDD.
collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File "C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 166, in main
File "C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 57, in read_command
File "C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 454, in loads
return pickle.loads(obj)
MemoryError
at org.apache.spark.api.python.
PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>
(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
1 more
回溯(最近一次呼叫最后一次):
文件“C:\ProgramData\Anaconda2\lib\site packages\flask\app.py”,第行
1982,
在wsgi_应用程序中
response=self.full\u dispatch\u request()
文件“C:\ProgramData\Anaconda2\lib\site packages\flask\app.py”,第行
1614,
完全发出请求
rv=自身处理用户异常(e)
文件“C:\ProgramData\Anaconda2\lib\site packages\flask\app.py”,第行
1517,
在句柄\u用户\u异常中
重放(exc_类型、exc_值、tb)
文件“C:\ProgramData\Anaconda2\lib\site packages\flask\app.py”,第行
1612,
完全发出请求
rv=自我分派请求()
文件“C:\ProgramData\Anaconda2\lib\site packages\flask\app.py”,第行
1598,
发出请求
返回self.view_函数[rule.endpoint](**req.view_参数)
文件
“C:/Users/Administrator/Desktop/Flex\u Api\u Post/
flex_api_post_func_spark_setup.py”,第152行,行程时间估计
count=ssc.parallelize(input_json).map(lambda j:
flex_func(j)).collect()
文件“C:\ProgramData\Anaconda2\lib\site packages\pyspark\rdd.py”,第行
809,
收款
port=self.ctx.\u jvm.PythonRDD.collectAndServe(self.\u jrdd.rdd())
文件“C:\ProgramData\Anaconda2\lib\site packages\py4j\java\u gateway.py”,
线
1160,接通电话__
回答,self.gateway\u客户端,self.target\u id,self.name)
文件“C:\ProgramData\Anaconda2\lib\site packages\py4j\protocol.py”,第行
320,在get_返回_值中
格式(目标id,“.”,名称),值)
Py4JJavaError:调用时出错
z:org.apache.spark.api.python.PythonRDD.collectAndServe。
:org.apache.spark.sparkeexception:由于以下原因,作业中止
阶段失败:任务7
在阶段13.0中失败1次,最近一次失败:
阶段中丢失任务7.0
13.0(TID 215、本地主机、执行器驱动程序):
org.apache.spark.api.python.PythonException:
回溯(最近的呼叫)
最后):
文件“C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,主视图第166行
文件“C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,第57行,在
读取命令
文件“C:\opt\spark\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py“,
第454行,输入载荷
返回酸洗负荷(obj)
记忆者
在
org.apache.spark.api.PythonRunner$$anon$1。
阅读(pythonrdscala:193)
位于org.apache.spark.api.python.PythonRunner$$anon$1。
(蟒蛇鳞:234)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
位于org.apache.spark.scheduler.Task.run(Task.scala:108)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:335)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(未知源)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(未知源)
位于java.lang.Thread.run(未知源)
驱动程序堆栈跟踪:
在
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
在
org.apache.spark.scheduler.DAGScheduler
$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
在org.apache.spark.scheduler上。
DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
在scala.collection。
mutable.resizeblearray$class.foreach(resizeblearray.scala:59)
位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
在org.apache.spark.scheduler上。
DAGScheduler.abortStage(DAGScheduler.scala:1486)
在org.apache.spark.scheduler上。
DAGScheduler$$anonfun$handleTaskSetFailed$1.apply
(DAGScheduler.scala:814)
在org.apache.spark.scheduler上。
DAGScheduler$$anonfun$handleTaskSetFailed$1.apply
(DAGScheduler.scala:814)
位于scala.Option.foreach(Option.scala:257)
在org.apache.spark.scheduler上。
DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
在org.apache.spark.scheduler上。
DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
在org.apache.spark.scheduler上。
DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
在org.apache.spark.scheduler上。
DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
位于org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
位于org.apache.spark.scheduler.DAGScheduler。
runJob(DAGScheduler.scala:630)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
位于org.apache.spark.rdd.rdd$$anonfun$collect$1.apply(rdd.scala:936)
位于org.apache.spark.rdd.RDDOperationScope$。
带示波器(RDDOperationScope.scala:151)
位于org.apache.spark.rdd.RDDOperationScope$。
带示波器(RDDOperationScope.scala:112)
位于org.apache.spark.rdd.rdd.withScope(rdd.scala:362)
位于org.apache.spark.rdd.rdd.collect(rdd.scala:935)
位于org.apache.spark.api.python.PythonRDD$。
collectAndServe(蟒蛇等级:458)
位于org.apache.spark.api.python.PythonRDD。
collectAndServe(蟒蛇鳞片)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(未知源)
在太阳报