Python 如何用新数据对sklearn中的logistic回归模型进行再训练_Python_Machine Learning_Scikit Learn_Logistic Regression

Python 如何用新数据对sklearn中的logistic回归模型进行再训练

python machine-learning scikit-learn

Python 如何用新数据对sklearn中的logistic回归模型进行再训练,python,machine-learning,scikit-learn,logistic-regression,Python,Machine Learning,Scikit Learn,Logistic Regression,如何在sklearnpython中重新训练现有的机器学习模型我有数千条记录，我使用这些记录训练我的模型，并使用pickle作为.pkl文件转储。在第一次训练模型时，我在创建逻辑回归对象时使用了warmStart=True参数示例代码： log_regression_model = linear_model.LogisticRegression(warm_start = True) log_regression_model.fit(X, Y) # Saved this model a

如何在sklearnpython中重新训练现有的机器学习模型

我有数千条记录，我使用这些记录训练我的模型，并使用

pickle

作为

.pkl

文件转储。在第一次训练模型时，我在创建逻辑回归对象时使用了

warmStart=True

参数

示例代码：

 log_regression_model =  linear_model.LogisticRegression(warm_start = True)
 log_regression_model.fit(X, Y)
 # Saved this model as .pkl file on filesystem like pickle.dump(model,open('model.pkl', wb))

#open the model from filesystem
log_regression_model = pickle.load(open('model.pkl','rb'))
log_regression_model.fit(X, Y) # New X, Y here is data of last 24 hours only. Few hundreds records only.

我想用我每天都会得到的新数据来更新这篇文章。为此，我打开现有的模型文件，获取过去24小时的新数据，并再次对其进行训练/

示例代码：

 log_regression_model =  linear_model.LogisticRegression(warm_start = True)
 log_regression_model.fit(X, Y)
 # Saved this model as .pkl file on filesystem like pickle.dump(model,open('model.pkl', wb))

#open the model from filesystem
log_regression_model = pickle.load(open('model.pkl','rb'))
log_regression_model.fit(X, Y) # New X, Y here is data of last 24 hours only. Few hundreds records only.

但是，当我通过从文件系统加载来重新训练模型时，它似乎会删除使用数千个记录创建的现有模型，并使用过去24小时内的少量数百个记录创建新模型（文件系统上有数千条记录的模型大小为3MB，而重新训练的新模型只有67KB）
我尝试过使用warmStart选项。
如何重新训练LogisticRegression模型？
LogicsticRegression对象的大小与用于训练它的样本数量无关

from sklearn.linear_model import LogisticRegression import pickle import sys np.random.seed(0) X, y = np.random.randn(100000, 1), np.random.randint(2, size=(100000,)) log_regression_model = LogisticRegression(warm_start=True) log_regression_model.fit(X, y) print(sys.getsizeof(pickle.dumps(log_regression_model))) np.random.seed(0) X, y = np.random.randn(100, 1), np.random.randint(2, size=(100,)) log_regression_model = LogisticRegression(warm_start=True) log_regression_model.fit(X, y) print(sys.getsizeof(pickle.dumps(log_regression_model)))
导致

1230 1233
您可能保存了错误的模型对象。确保您正在保存日志回归模型

pickle.dump(log_regression_model, open('model.pkl', 'wb'))
由于模型大小如此不同，而且
LogisticRegression
对象的大小不会随着不同数量的训练样本而改变，因此看起来使用不同的代码来生成保存的模型和新的“重新训练”模型
综上所述，看起来warm_start在这里什么也没做：

np.random.seed(0) X, y = np.random.randn(200, 1), np.random.randint(2, size=(200,)) log_regression_model = LogisticRegression(warm_start=True) log_regression_model.fit(X[:100], y[:100]) print(log_regression_model.intercept_, log_regression_model.coef_) log_regression_model.fit(X[100:], y[100:]) print(log_regression_model.intercept_, log_regression_model.coef_) log_regression_model = LogisticRegression(warm_start=False) log_regression_model.fit(X[100:], y[100:]) print(log_regression_model.intercept_, log_regression_model.coef_) log_regression_model = LogisticRegression(warm_start=False) log_regression_model.fit(X, y) print(log_regression_model.intercept_, log_regression_model.coef_)
给出：

(array([ 0.01846266]), array([[-0.32172516]])) (array([ 0.17253402]), array([[ 0.33734497]])) (array([ 0.17253402]), array([[ 0.33734497]])) (array([ 0.09707612]), array([[ 0.01501025]]))
基于，
warm\u start
如果您使用另一个解算器（例如
LogisticRegression（warm\u start=True，solver='sag'）
），将产生一定的效果，但这仍然与在添加新数据的整个数据集上重新训练不同。例如，上述四个输出变为：

(array([ 0.01915884]), array([[-0.32176053]])) (array([ 0.17973458]), array([[ 0.33708208]])) (array([ 0.17968324]), array([[ 0.33707362]])) (array([ 0.09903978]), array([[ 0.01488605]]))

你可以看到中间的两条线不同，但差别不大。它所做的只是使用上一个模型的参数作为起点，用新数据重新训练新模型。听起来您想要做的是保存数据，并在每次添加数据时使用合并的旧数据和新数据对其进行重新训练。
当您在经过训练的模型上使用
fit
时，您基本上放弃了以前的所有信息
Scikit learn的一些模型具有可用于增量训练的
部分拟合方法，如图所示我不记得是否可以在sklearn中重新训练Logistic回归，但sklearn有SGDClassizer ，它使用loss=log 运行带有随机梯度下降优化的Logistic回归，它有partial\u-fit 方法。一个问题：您不能将新数据添加到原始数据中，并对整个数据集进行重新训练吗？作为旁注，我将检查以下链接：。然后我会考虑在神经网络中经常使用的小批量策略（你需要自己实现梯度下降），但是对于逻辑回归很容易（检查）。但是，使用此策略，您需要对整个数据集进行几次遍历…使用新数据和旧数据再次训练模型是不高效的，因为数据量巨大，并且使用当前资源，训练模型需要24小时以上，