Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 具有大量特征的Pypark PCA_Apache Spark_Pyspark_Bigdata_Pca - Fatal编程技术网

Apache spark 具有大量特征的Pypark PCA

Apache spark 具有大量特征的Pypark PCA,apache-spark,pyspark,bigdata,pca,Apache Spark,Pyspark,Bigdata,Pca,我正在尝试从运行示例pyspark PCA代码 我加载的DataFrame包含5000000条记录和23000个功能。 运行PCA代码后,我发现以下错误 Py4JJavaError: An error occurred while calling o908.fit. : java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) a

我正在尝试从运行示例pyspark PCA代码

我加载的DataFrame包含5000000条记录和23000个功能。 运行PCA代码后,我发现以下错误

Py4JJavaError: An error occurred while calling o908.fit.
: java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:793)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1137)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:122)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:344)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:387)
    at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
    at org.apache.spark.ml.feature.PCA.fit(PCA.scala:99)
    at org.apache.spark.ml.feature.PCA.fit(PCA.scala:70)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
spark版本是2.2 我用纱线织火花 和火花参数为:

spark.executor.memory=32G
spark.driver.memory=32G
spark.driver.maxResultSize=32G

我应该删除功能以运行PCA吗?还是其他解决方案?

我怀疑您可以使用不同的配置来运行它。你有多少遗嘱执行人?如果您有100个执行器,并且在一个总内存为1TB的系统上每个执行器都分配了32GB,那么当每个执行器试图获取总内存为3.2TB(不存在)的一部分时,您将很快用完。另一方面,如果有1个执行器正在运行,32GB可能不足以运行该任务。您可能会发现,运行20个执行器,每个执行器都有8GB的内存,这将为您提供足够的内存来运行作业(尽管速度可能较慢)

当我在ML过程中遇到数据帧问题时,我通常按照以下步骤进行故障排除: 1) 在一个非常小的数据帧上测试该方法:10个特性和1000行。为了帮助避免沿袭问题,我建议您在源代码处减少示例框架,或者在SQL中使用“limit”语句,或者通过传递较小的CSV。如果该方法不适用于您的代码,那么内存问题可能是次要的。 2) 如果该方法在一个非常小的数据帧上不起作用,那么就开始调查数据本身。你的特征都是数字吗?您的任何功能都有空值吗?特征中有非数值或空值可能会导致PCA例程中断(但不一定会出现OutOfMemory错误)
3) 如果数据格式正确,并且代码格式正确,请开始放大,并确保在继续操作时查看节点中的stderr和stdout。要访问您的节点,您应该有一个实用程序(例如,hadoop的Cloudera发行版包括ClouderaManager,它允许您查看作业,然后是阶段,然后是单个任务以查找stderr)

您请求了多少执行器,系统上有多少可用内存?