Python 为什么SKF模型在保存后会占用大量磁盘空间？_Python_Scikit Learn_Random Forest

Python 为什么SKF模型在保存后会占用大量磁盘空间？

python scikit-learn

Python 为什么SKF模型在保存后会占用大量磁盘空间？,python,scikit-learn,random-forest,Python,Scikit Learn,Random Forest,我正在用下面的代码保存sklearn库中的RandomForestClassifier模型 with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f) 我的硬盘占用了很多空间。该模型中只有50棵树，但它在磁盘上占用的空间超过50MB（分析的数据集约为20MB，具有21个功能）。有人知道为什么吗？我观察到ExtraTreesClassifier的类似行为编辑：射频参数： "n_estimators": 50, "m

我正在用下面的代码保存sklearn库中的RandomForestClassifier模型

with open('/tmp/rf.model', 'wb') as f:
    cPickle.dump(RF_model, f)

我的硬盘占用了很多空间。该模型中只有50棵树，但它在磁盘上占用的空间超过50MB（分析的数据集约为20MB，具有21个功能）。有人知道为什么吗？我观察到ExtraTreesClassifier的类似行为

编辑：射频参数：

"n_estimators": 50,
"max_features": 0.2,
"min_samples_split": 20,
"criterion": "gini",
"min_samples_leaf": 11

根据@dooms的建议，我检查了sys.getsizeof，它返回64-我假设这只是指针大小

我尝试了其他方法来保存模型：

from sklearn.externals import joblib
joblib.dump(RF_model, 'filename.pkl')

通过使用这种方法，我得到了1*.pkl文件和201*.npy文件，总大小为14.9 MB，比以前的53 MB小。这201个npy文件中有一个模式-林中每棵树有4个文件：

第一个文件（231KB）内容：

第三个文件（88B）：

组（96B）中的最后一个文件：

知道是什么吗？我试图研究sklearn中的树代码，但这很难。有没有办法保存sklearn tree，让它存储更少的磁盘？（只是要指出，xgboost的相似大小集合需要约200KB的总大小）

分类器的参数是什么？树的数量和最大深度/最小样本数{split，leaf}是相关的。

array([(1, 1062, 20, 0.2557438611984253, 0.4997574055554296, 29168, 46216.0),
       (2, 581, 12, 0.5557271242141724, 0.49938159451291675, 7506, 11971.0),
       (3, 6, 14, 0.006186043843626976, 0.4953095968671224, 4060, 6422.0),
       ...,
       (4123, 4124, 15, 0.6142271757125854, 0.4152249134948097, 31, 51.0),
       (-1, -1, -2, -2.0, 0.495, 11, 20.0),
       (-1, -1, -2, -2.0, 0.3121748178980229, 20, 31.0)], 
      dtype=[('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')])

array([[[  2.25990000e+04,   2.36170000e+04]],

       [[  6.19600000e+03,   5.77500000e+03]],

       [[  3.52200000e+03,   2.90000000e+03]],

       ..., 
       [[  3.60000000e+01,   1.50000000e+01]],

       [[  1.10000000e+01,   9.00000000e+00]],

       [[  2.50000000e+01,   6.00000000e+00]]])

array([2])

array([ 0.,  1.])