Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将机器学习模型保存/覆盖为spark cluster中的单个文件_Python_Apache Spark_Machine Learning_Pyspark - Fatal编程技术网

Python 将机器学习模型保存/覆盖为spark cluster中的单个文件

Python 将机器学习模型保存/覆盖为spark cluster中的单个文件,python,apache-spark,machine-learning,pyspark,Python,Apache Spark,Machine Learning,Pyspark,我有一个使用线性回归的机器学习模型。我有5个虚拟机的火花集群。训练完模型后,我想保存模型,这样我可以在加载到内存后使用它 我试过使用 model.save("/tmp/model.pkl"). 这样保存时,它会在集群的所有节点中创建名为model.pkl的目录,这些节点的文件为data/、metadata/、\u SUCCESS、.\u SUCCESS.crc、\u temporary、。。还有一些 是否有办法将模型保存为单个文件,如model.pkl 另外,当我使用新的可用数据重新训练模型时

我有一个使用线性回归的机器学习模型。我有5个虚拟机的火花集群。训练完模型后,我想保存模型,这样我可以在加载到内存后使用它

我试过使用

model.save("/tmp/model.pkl").
这样保存时,它会在集群的所有节点中创建名为model.pkl的目录,这些节点的文件为
data/、metadata/、\u SUCCESS、.\u SUCCESS.crc、\u temporary、。。还有一些

是否有办法将模型保存为单个文件,如
model.pkl

另外,当我使用新的可用数据重新训练模型时,我正在使用
model.write().overwrite().save(“/tmp/model.pkl”)
覆盖现有模型,以便将新更新的模型持久保存在文件系统中

但我得到的异常是
filealreadyexistException

An error occurred while calling o94.save.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/tmp/cat_model.pkl/metadata already exists
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1096)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1070)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1035)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1035)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1035)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:961)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:961)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:961)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:960)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1489)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1468)
    at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:278)
    at org.apache.spark.ml.regression.LinearRegressionModel$LinearRegressionModelWriter.saveImpl(LinearRegression.scala:540)
    at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
如何覆盖现有模型

我拥有集群所有节点上目录
/tmp
的写入权限

尝试使用
model.load('/tmp/model.pkl')
加载模型时, 我得到的错误是

An error occurred while calling o94.load.
: java.lang.UnsupportedOperationException: empty collection
似乎,
save(path)
没有正确保存模型。 如何正确加载保存的模型。
在spark中保存并再次加载学习模型的正确方法是什么;DR在使用群集时使用分布式文件系统

是否有办法将模型保存为单个文件,如model.pkl

事实并非如此。输出中的不同文件与模型的不同组件相关

另外,当我使用新的可用数据重新训练模型时,我正在使用model.write().overwrite().save(“/tmp/model.pkl”)覆盖现有模型,因此新更新的模型将被持久保存在文件系统中(…),然后我将异常作为filealreadyexistException

一般来说,您不应该在集群上使用本地文件系统进行写操作。虽然写入可能部分成功(请注意,
\u temporary
目录没有像分布式文件系统那样正确删除),但在这种情况下无法加载数据,因为执行者将看到文件系统的不一致状态