Apache spark 触发大的反序列化时间

Apache spark 触发大的反序列化时间,apache-spark,pyspark,Apache Spark,Pyspark,我是Spark的新手,似乎有一些性能问题。我试图计算数据帧中不同参数之间的简单计算(我使用Spark 1.5.2上的PySpark来完成),但问题是,与实际计算相比,我的任务反序列化时间非常长 以下是计算两对不同参数之间的计算时的屏幕截图 为了计算相关性,我只使用了full_dataframe.stat.corr('param1','param2')。首先缓存数据集,然后执行此计算。我实际上试图计算所有参数之间的相关性并生成一个相关性映射,所以我在循环中调用这一行,在循环中迭代不同的参数组合

我是Spark的新手,似乎有一些性能问题。我试图计算数据帧中不同参数之间的简单计算(我使用Spark 1.5.2上的PySpark来完成),但问题是,与实际计算相比,我的任务反序列化时间非常长

以下是计算两对不同参数之间的计算时的屏幕截图

为了计算相关性,我只使用了
full_dataframe.stat.corr('param1','param2')
。首先缓存数据集,然后执行此计算。我实际上试图计算所有参数之间的相关性并生成一个相关性映射,所以我在循环中调用这一行,在循环中迭代不同的参数组合。缓存数据集的大小为5.2GB

我在一台4簇机器(纱线)上运行此作业,其中每台机器都有:

  • 10GB内存(纱线预留8GB)
  • 8芯(16个虚拟芯,14个预留用于纱线)
我通过Jupyter使用PySpark,我开始使用:

pyspark--主纱线--驱动程序内存2560m--num executors 4--executor cores 4--executor memory 5G--conf spark.warn.executor.memoryOverhead=2048

我尝试过使用
df.repartition(没有分区)
使用不同数量的分区,例如16、32、128、256,但没有任何帮助

此外,一段时间后,我的工作完全中断,我从ui中得到以下错误:

HTTP ERROR 500

Problem accessing /proxy/application_1485432889177_0016/stages/stage. Reason:

    Connection to http://192.168.84.27:4040 refused
Caused by:

org.apache.http.conn.HttpHostConnectException: Connection to http://192.168.84.27:4040 refused
当我查看Jupyter的输出时,我看到以下例外情况:

17/01/29 17:06:06 ERROR Utils: Uncaught exception in thread task-result-getter-14
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
        at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
        at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
        at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:501)
Exception in thread "task-result-getter-13" 17/01/29 17:06:06 ERROR Utils: Uncaught exception in thread task-result-getter-15
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
        at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
        at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
        at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:501)
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
        at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
        at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
        at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:501)
Exception in thread "task-result-getter-12" Exception in thread "task-result-getter-14" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
        at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
        at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
        at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
    17/01/29 21:37:43 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-18] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.reflect.ByteVectorImpl.resize(ByteVectorImpl.java:84)
        at sun.reflect.ByteVectorImpl.add(ByteVectorImpl.java:63)
        at sun.reflect.ClassFileAssembler.emitByte(ClassFileAssembler.java:74)
        at sun.reflect.ClassFileAssembler.emitShort(ClassFileAssembler.java:63)
        at sun.reflect.ClassFileAssembler.emitConstantPoolNameAndType(ClassFileAssembler.java:120)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:313)
        at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
        at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
        at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$writeObject$1.apply$mcV$sp(TorrentBroadcast.scala:162)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
        at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:160)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
17/01/29 21:37:46 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.reflect.ByteVectorImpl.resize(ByteVectorImpl.java:84)
        at sun.reflect.ByteVectorImpl.add(ByteVectorImpl.java:63)
        at sun.reflect.ClassFileAssembler.emitByte(ClassFileAssembler.java:74)
        at sun.reflect.ClassFileAssembler.emitShort(ClassFileAssembler.java:63)
        at sun.reflect.ClassFileAssembler.emitConstantPoolNameAndType(ClassFileAssembler.java:120)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:313)
        at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
        at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
        at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$writeObject$1.apply$mcV$sp(TorrentBroadcast.scala:162)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
        at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:160)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
17/01/29 21:37:50 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-17] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.reflect.ByteVectorImpl.resize(ByteVectorImpl.java:84)
        at sun.reflect.ByteVectorImpl.add(ByteVectorImpl.java:63)
        at sun.reflect.ClassFileAssembler.emitByte(ClassFileAssembler.java:74)
        at sun.reflect.ClassFileAssembler.emitShort(ClassFileAssembler.java:63)
        at sun.reflect.ClassFileAssembler.emitConstantPoolNameAndType(ClassFileAssembler.java:120)
        at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:313)
        at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
        at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
        at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
        at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$writeObject$1.apply$mcV$sp(TorrentBroadcast.scala:162)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
        at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:160)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:54 ERROR ErrorMonitor: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem [sparkDriver]
HTTP ERROR 500

Problem accessing /jobs/. Reason:

    Server Error
Caused by:

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOf(Arrays.java:3332)