Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby-on-rails-4/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在Spark SQL中读取40万行时出现内存不足错误_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 在Spark SQL中读取40万行时出现内存不足错误

Python 在Spark SQL中读取40万行时出现内存不足错误,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一些关于postgres的数据,并试图在spark dataframe上读取这些数据,但我得到了错误java.lang.OutOfMemoryError:GC开销超出了限制。我使用的是8GB内存的PySpark 下面是代码 import findspark findspark.init() from pyspark import SparkContext, SQLContext sc = SparkContext() sql_context = SQLContext(sc) temp_df

我有一些关于postgres的数据,并试图在spark dataframe上读取这些数据,但我得到了错误
java.lang.OutOfMemoryError:GC开销超出了限制
。我使用的是8GB内存的PySpark

下面是代码

import findspark
findspark.init()
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
temp_df = sql_context.read.format('jdbc').options(url="jdbc:postgresql://localhost:5432/database",
            dbtable="table_name",
            user="user",
            password="password",
            driver="org.postgresql.Driver").load()
我对火花的世界很陌生。我用python pandas做了同样的尝试,它工作起来没有任何问题,但使用spark时出现了错误

Exception in thread "refresh progress" java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.VectorBuilder.<init>(Vector.scala:713)
at scala.collection.immutable.Vector$.newBuilder(Vector.scala:22)
at scala.collection.immutable.IndexedSeq$.newBuilder(IndexedSeq.scala:46)
at scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70)
at scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104)
at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.ui.ConsoleProgressBar$$anonfun$3.apply(ConsoleProgressBar.scala:89)
at org.apache.spark.ui.ConsoleProgressBar$$anonfun$3.apply(ConsoleProgressBar.scala:82)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:82)
at org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71)
at org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:56)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Exception in thread "RemoteBlock-temp-file-clean-thread" java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.spark.storage.BlockManager$RemoteBlockDownloadFileManager.org$apache$spark$storage$BlockManager$RemoteBlockDownloadFileManager$$keepCleaning(BlockManager.scala:1648)
    at org.apache.spark.storage.BlockManager$RemoteBlockDownloadFileManager$$anon$1.run(BlockManager.scala:1615)
2018-11-12 21:48:16 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
    at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
    at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
    at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
    at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 14 more
2018-11-12 21:48:16 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-11-12 21:48:16 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[Executor task launch worker for task 0,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-11-12 21:48:16 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.OutOfMemoryError: GC overhead limit exceeded

2018-11-12 21:48:16 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job
线程“刷新进度”java.lang.OutOfMemoryError中出现异常:超出GC开销限制 位于scala.collection.immutable.VectorBuilder。(Vector.scala:713) 位于scala.collection.immutable.Vector$.newBuilder(Vector.scala:22) 在scala.collection.immutable.IndexedSeq$.newBuilder(IndexedSeq.scala:46) 位于scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70) 位于scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104) 位于scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57) 位于scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52) 位于scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) 位于scala.collection.TraversableLike$class.map(TraversableLike.scala:233) 位于scala.collection.AbstractTraversable.map(Traversable.scala:104) 在org.apache.spark.ui.ConsoleProgressBar$$anonfun$3.apply上(ConsoleProgressBar.scala:89) 在org.apache.spark.ui.ConsoleProgressBar$$anonfun$3.apply上(ConsoleProgressBar.scala:82) 在scala.collection.TraversableLike$$anonfun$map$1.apply处(TraversableLike.scala:234) 在scala.collection.TraversableLike$$anonfun$map$1.apply处(TraversableLike.scala:234) 位于scala.collection.immutable.List.foreach(List.scala:381) 位于scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 位于scala.collection.immutable.List.map(List.scala:285) 位于org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:82) 在org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71) 位于org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:56) 位于java.util.TimerThread.mainLoop(Timer.java:555) 在java.util.TimerThread.run(Timer.java:505) 线程“RemoteBlock temp file clean thread”java.lang.OutOfMemoryError中出现异常:超出GC开销限制 在 org.apache.spark.storage.BlockManager$RemoteBlockDownloadFileManager.org$apache$spark$storage$BlockManager$RemoteBlockDownloadFileManager$$keepclean(BlockManager.scala:1648) 在org.apache.spark.storage.BlockManager$RemoteBlockDownloadFileManager$$anon$1.run(BlockManager.scala:1615) 2018-11-12 21:48:16警告执行者:87-与心脏跳动器中的驾驶员通信的问题 org.apache.spark.rpc.RpcTimeoutException:期货在[10秒]后超时。此超时由spark.executor.heartbeatInterval控制 在org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) 在org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIftTimeout$1.applyOrElse(RpcTimeout.scala:62) 在org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIftTimeout$1.applyOrElse(RpcTimeout.scala:58) 在scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)中 位于org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) 位于org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) 在org.apache.spark.executor.executor.org$apache$spark$executor$executor$$reportHeartBeat(executor.scala:785) 在org.apache.spark.executor.executor$$anon$2$$anonfun$run$1.apply$mcV$sp(executor.scala:814) 位于org.apache.spark.executor.executor$$anon$2$$anonfun$run$1.apply(executor.scala:814) 位于org.apache.spark.executor.executor$$anon$2$$anonfun$run$1.apply(executor.scala:814) 位于org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) 位于org.apache.spark.executor.executor$$anon$2.run(executor.scala:814) 位于java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 位于java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 位于java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) 位于java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) 位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 运行(Thread.java:748) 原因:java.util.concurrent.TimeoutException:期货在[10秒]后超时 在scala.concurrent.impl.Promise$DefaultPromise.ready处(Promise.scala:219) 在scala.concurrent.impl.Promise$DefaultPromise.result处(Promise.scala:223) 位于org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) 位于org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 14多 2018-11-12 21:48:16错误执行者:91-任务0.0在阶段0.0中出现异常(TID 0) java.lang.OutOfMemoryError:超出GC开销限制 2018-11-12 21:48:16错误SparkUncaughtExceptionHandler:91-线程中出现未捕获异常[Executor task launch worker for task 0,5,main] java.lang.OutOfMemoryError:超出GC开销限制 2018-11-12 21:48:16警告TaskSetManager:66-在阶段0.0中丢失任务0.0(TID 0,本地主机,执行器驱动程序):java.lang.OutOfMemoryError:超出GC开销限制 2018-11-12 21:48:16错误TaskSetManager:70-0.0阶段的任务0失败1次;中止工作
我的最终目标是使用spark对大型数据库表进行一些处理。任何帮助都会很好。

很抱歉,您的内存似乎不够。此外,spark旨在处理具有大量数据(群集)的分布式系统,因此它可能不是您所做工作的最佳选择

问候

编辑 正如@LiJianing所建议的那样
from pyspark import SparkConf, SparkContext
conf = (SparkConf().set("spark.executor.memory", "8g"))
sc = SparkContext(conf = conf)