Python 从PySpark外壳设置应用程序内存大小

Python 从PySpark外壳设置应用程序内存大小,python,apache-spark,Python,Apache Spark,我正在尝试对2500万个整数进行排序。但是当我尝试使用collect()时,它给了我一个OutofMemory错误:Java堆空间错误。以下是源代码: sc = SparkContext("local", "pyspark") numbers = sc.textFile("path of text file") counts = numbers.flatMap(lambda x: x.split()).map(lambda x: (int(x), 1)).sortByKey(lambda x:x

我正在尝试对2500万个整数进行排序。但是当我尝试使用
collect()
时,它给了我一个
OutofMemory错误:Java堆空间
错误。以下是源代码:

sc = SparkContext("local", "pyspark")
numbers = sc.textFile("path of text file")
counts = numbers.flatMap(lambda x: x.split()).map(lambda x: (int(x), 1)).sortByKey(lambda x:x)
num_list = []
for (num, count) in counts.collect():
    num_list.append(num)
我哪里做错了?文本文件的大小为147MB。所有设置都是默认设置。我使用的是Spark v0.9.0

编辑:包含250万个整数的Works文件。但问题从500万开始。也用1000万进行了测试,得到了相同的OME错误

以下是堆栈跟踪:

14/02/06 22:44:31 ERROR Executor: Exception in task ID 5
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2798)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:48)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:223)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:679)
14/02/06 22:44:31 WARN TaskSetManager: Lost TID 5 (task 0.0:0)
14/02/06 22:44:31 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2798)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:48)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:223)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:679)
14/02/06 22:44:31 ERROR TaskSetManager: Task 0.0:0 failed 1 times; aborting job
14/02/06 22:44:31 INFO TaskSchedulerImpl: Remove TaskSet 0.0 from pool 
14/02/06 22:44:31 INFO DAGScheduler: Failed to run collect at <ipython-input-7-cf9439751c70>:1
14/02/06 22:44:31错误执行者:任务ID 5中出现异常
java.lang.OutOfMemoryError:java堆空间
位于java.util.Arrays.copyOf(Arrays.java:2798)
在java.io.ByteArrayOutputStream.write处(ByteArrayOutputStream.java:111)
位于java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
位于java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
位于java.io.ObjectOutputStream.WriteObject 0(ObjectOutputStream.java:1186)
位于java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
位于org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
位于org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:48)
在org.apache.spark.executor.executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(executor.scala:223)
位于org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:178)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
运行(Thread.java:679)
14/02/06 22:44:31警告任务集管理器:丢失TID 5(任务0.0:0)
14/02/06 22:44:31警告TaskSetManager:丢失是由于java.lang.OutOfMemoryError导致的
java.lang.OutOfMemoryError:java堆空间
位于java.util.Arrays.copyOf(Arrays.java:2798)
在java.io.ByteArrayOutputStream.write处(ByteArrayOutputStream.java:111)
位于java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
位于java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
位于java.io.ObjectOutputStream.WriteObject 0(ObjectOutputStream.java:1186)
位于java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
位于org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
位于org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:48)
在org.apache.spark.executor.executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(executor.scala:223)
位于org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:178)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
运行(Thread.java:679)
14/02/06 22:44:31错误TaskSetManager:任务0.0:0失败1次;中止工作
14/02/06 22:44:31信息TaskSchedulerImpl:从池中删除任务集0.0
14/02/06 22:44:31信息计划程序:无法在以下位置运行收集:1