R 为什么在套索回归中计算MSE会给出不同的输出？_R_Machine Learning_Glmnet_Lasso Regression_Mean Square Error

R 为什么在套索回归中计算MSE会给出不同的输出？

r machine-learning

R 为什么在套索回归中计算MSE会给出不同的输出？,r,machine-learning,glmnet,lasso-regression,mean-square-error,R,Machine Learning,Glmnet,Lasso Regression,Mean Square Error,我试图对lasso2软件包中的前列腺癌数据运行不同的回归模型。当我使用套索时，我看到了两种不同的方法来计算均方误差。但是他们给了我完全不同的结果，所以我想知道我是否做错了什么，或者这仅仅意味着一种方法比另一种更好 # Needs the following R packages. library(lasso2) library(glmnet) # Gets the prostate cancer dataset data(Prostate) # Defines the Mean Square

我试图对lasso2软件包中的前列腺癌数据运行不同的回归模型。当我使用套索时，我看到了两种不同的方法来计算均方误差。但是他们给了我完全不同的结果，所以我想知道我是否做错了什么，或者这仅仅意味着一种方法比另一种更好

# Needs the following R packages.
library(lasso2)
library(glmnet)

# Gets the prostate cancer dataset
data(Prostate)

# Defines the Mean Square Error function 
mse = function(x,y) { mean((x-y)^2)}

# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))

# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)

# Training set
train = Prostate[train_ind, ]

# Test set
test = Prostate[-train_ind, ]

# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa

# Fitting a linear model by Lasso regression on the "train" data set
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)
lambda.lasso = pr.lasso$lambda.min

# Getting predictions on the "test" data set and calculating the mean     square error
lasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest) 

# Calculating MSE via the mse function defined above
mse.1 = mse(lasso.pred,ytest)
cat("MSE (method 1): ", mse.1, "\n")

# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")

这是我得到的两个MSE的输出：

MSE (method 1): 0.4609978 
MSE (method 2): 0.5654089

他们完全不同。有人知道为什么吗？非常感谢您的帮助

Samuel

正如@alistaire所指出的，在第一种情况下，您使用测试数据来计算MSE，在第二种情况下，报告交叉验证（训练）折叠的MSE，因此这不是苹果对苹果的比较

我们可以执行如下操作来进行苹果对苹果的比较（通过将拟合值保留在训练折叠上），我们可以看到，如果在相同的训练折叠上计算，mse.1和mse.2完全相等（尽管该值与您的值略有不同，我的桌面R版本3.1.2，x86_64-w64-mingw32，windows 10）：

#需要以下R包。
图书馆（lasso2）
图书馆（glmnet）
#获取前列腺癌数据集
数据（前列腺）
#定义均方误差函数
mse=函数（x，y）{平均值（（x-y）^2）}
#样本量的75%。
smp_尺寸=地板（0.75*nrow（前列腺））
#设置种子以使分区可复制。
种子集（907）
序列号=样本（序列号（nrow（前列腺）），尺寸=smp尺寸）
#训练集
列车=前列腺[列车标识，]
#测试集
测试=前列腺[-序列索引，]
#为自变量和因变量创建矩阵。
xtrain=模型矩阵（lpsa~-1，数据=列车）
ytrain=列车$lpsa
xtest=模型.矩阵（lpsa~-1，数据=测试）
ytest=测试$lpsa
#在“train”数据集上用Lasso回归拟合线性模型
#将拟合值保留在训练文件夹上
pr.lasso=cv.glmnet（xtrain，ytrain，type.measure='mse'，keep=TRUE，alpha=1）
lambda.lasso=pr.lasso$lambda.min
lambda.id如果我阅读正确，mse.1
是测试mse，mse.2是所选模型的交叉验证错误，但仅基于训练数据。感谢您指出这一点。因此，正确的顺序是在训练数据上运行cv.glmnet，以获得最佳lambda，然后使用方法1计算MSE？我假设在测试数据上再次运行cv.glmnet以获取cvm（平均交叉验证错误）是没有意义的？抱歉，我有点困惑。您使用交叉验证来估计测试错误，以便确定在测试数据上使用哪个lambda值。您只能触摸测试数据一次。@alistaire谢谢！所以我可以放弃第二种方法，它实际上没有意义。谢谢你的回答@sandipan。因此cv.glmnet默认使用10倍。通过指定更大的折叠数，是否意味着交叉验证的结果会更好？折叠数越高可能会导致过度拟合，因为需要训练的数据越多。好的，所以最好保留默认值K=10？或者有没有办法找到最佳的折叠次数？看看这个谢谢链接！
# Needs the following R packages.
library(lasso2)
library(glmnet)

# Gets the prostate cancer dataset
data(Prostate)

# Defines the Mean Square Error function 
mse = function(x,y) { mean((x-y)^2)}

# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))

# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)

# Training set
train = Prostate[train_ind, ]

# Test set
test = Prostate[-train_ind, ]

# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa

# Fitting a linear model by Lasso regression on the "train" data set
# keep the fitted values on the training folds
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)
lambda.lasso = pr.lasso$lambda.min
lambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)

# get the predicted values on the training folds with lambda.min (not from test data)
mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain) 
cat("MSE (method 1): ", mse.1, "\n")

MSE (method 1):  0.6044496 

# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")

MSE (method 2):  0.6044496 

mse.1 == mse.2
[1] TRUE