Python 了解sklearn MLP中的热启动
请注意: Python版本:Python 了解sklearn MLP中的热启动,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,请注意: Python版本:Python 3.5.0 Sklearn版本:0.20.3 我有一个在sklearn软件包,我正在使用,它取得了相当好的效果 我正在运行的代码如下: from sklearn.neural_network import MLPRegressor from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler, StandardScal
Python 3.5.0
Sklearn版本:0.20.3
我有一个在sklearn软件包,我正在使用,它取得了相当好的效果
我正在运行的代码如下:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn import preprocessing
import pandas as pd, numpy as np
import sklearn
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def robustscale(data):
scaler = RobustScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
return df_scaled
total_avgs = []
def driver(data, labels, model, scaling):
best_model = None
best = 1000000
avgs = []
for x in range(5):
X_train, X_test, y_train, y_test = train_test_split(data, label, shuffle=True, test_size = 0.2)
model.fit(X_train, y_train)
preds = model.predict(X_test)
differences = np.average(compare_values(y_test, preds))
print("CURRENT MODEL Average: {}".format(differences))
if differences > best:
best_model = model
avgs.append(differences)
total_avgs.append(differences)
print("Average Performance Overall: {}".format(np.average(avgs)))
print("Best Performance Overall: {}".format(np.min(avgs)))
data = pd.read_csv('new.csv')
# handle some data manipulation. Dropping columns and such. Nothing important
data = data
rb_data = robustscale(data)
mlp = MLPRegressor(
activation = 'tanh',
hidden_layer_sizes = (1000, 1000, 1000),
alpha = 0.009,
learning_rate = 'invscaling',
learning_rate_init = 0.01,
max_iter = 200,
momentum = 0.9,
solver = 'lbfgs',
warm_start = False
)
print("############################################")
print("NOW TESTING ROBUST SCALE DATA: ")
driver(rb_data, label, mlp, "rb")
print("############################################")
print("\n")
print("BEST MODEL PERFORMANCE: {}".format(np.min(total_avgs)))
我试图理解为什么我在回归问题上得到如此好的结果
我的MLP是这样配置的(使用后选择的参数)
(是的,我也发现它很奇怪,relu
未被选中。但它从未被选中)
设置时,我得到如下输出:
############################################
NOW TESTING ROBUST SCALE DATA:
CURRENT MODEL Average: 21.163831505120193
CURRENT MODEL Average: 12.44361687293673
CURRENT MODEL Average: 5.687720697116947
CURRENT MODEL Average: 4.225979713815092
CURRENT MODEL Average: 5.235999000929669
Average Performance Overall: 9.751429557983725
Best Performance Overall: 4.225979713815092
############################################
很明显,每次跑步的表现都会越来越好
但是,当我设置warm\u start=False
时,我得到:
############################################
NOW TESTING ROBUST SCALE DATA:
CURRENT MODEL Average: 25.221720858740714
CURRENT MODEL Average: 20.3609370299473
CURRENT MODEL Average: 23.385534335200845
CURRENT MODEL Average: 21.89668702232435
CURRENT MODEL Average: 15.38606220618026
Average Performance Overall: 21.250188290478693
Best Performance Overall: 15.38606220618026
############################################
显然,
warm\u start=True
正以积极的方式影响性能。但是怎么做呢?在循环的每次运行中,我都会随机重新发送数据,创建一个全新的模型,并运行测试。新模型是如何从旧模型中学习的?简单的解释是,您的模型已经“看到”了您在每个循环中测试的数据,并且拥有这些数据的“记忆”。换句话说,当你使用热启动时,你的测试数据不再独立于你的训练数据,这就是为什么你会得到不切实际的好结果。如果尝试交叉验证设置,则不应使用“热启动”。测试数据应远离拆分和缩放。在分割和训练之前缩放整个数据集,会在训练部分和测试部分之间产生类似的“泄漏”数据的效果。请看这里:
从中,您正在创建一个新模型,但告诉回归者“重用上一次调用的解决方案以适应初始化”。抱歉,但你要求的澄清并不完全清楚我也看过文件。我想我不明白的是引用的那句话。一个新的模型如何知道以前的实例化?假设,如果我运行了1000次,我的数据不只是过度拟合,而不是进行调整吗?我想我不明白一个新的模型是如何从以前的模型中学习的,以及如何过度拟合
############################################
NOW TESTING ROBUST SCALE DATA:
CURRENT MODEL Average: 25.221720858740714
CURRENT MODEL Average: 20.3609370299473
CURRENT MODEL Average: 23.385534335200845
CURRENT MODEL Average: 21.89668702232435
CURRENT MODEL Average: 15.38606220618026
Average Performance Overall: 21.250188290478693
Best Performance Overall: 15.38606220618026
############################################