如何从r中的线性模型中获得交叉验证的r平方?
我有一个R的线性模型如何从r中的线性模型中获得交叉验证的r平方?,r,linear-regression,cross-validation,R,Linear Regression,Cross Validation,我有一个R的线性模型 set.seed(1234) x <- rnorm(100) z <- rnorm(100) y <- rnorm(100, x+z) mydata <- data.frame(x,y,z) fit <- lm(y ~ x + z, mydata) set.seed(1234) 因此,接下来是一个轻微的适应。基本上,我修改了这个示例,使其成为一个函数 library(bootstrap) k_fold_rsq <- function
set.seed(1234)
x <- rnorm(100)
z <- rnorm(100)
y <- rnorm(100, x+z)
mydata <- data.frame(x,y,z)
fit <- lm(y ~ x + z, mydata)
set.seed(1234)
因此,接下来是一个轻微的适应。基本上,我修改了这个示例,使其成为一个函数
library(bootstrap)
k_fold_rsq <- function(lmfit, ngroup=10) {
# assumes library(bootstrap)
# adapted from http://www.statmethods.net/stats/regression.html
mydata <- lmfit$model
outcome <- names(lmfit$model)[1]
predictors <- names(lmfit$model)[-1]
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
X <- as.matrix(mydata[predictors])
y <- as.matrix(mydata[outcome])
results <- crossval(X,y,theta.fit,theta.predict,ngroup=ngroup)
raw_rsq <- cor(y, lmfit$fitted.values)**2 # raw R2
cv_rsq <- cor(y,results$cv.fit)**2 # cross-validated R2
c(raw_rsq=raw_rsq, cv_rsq=cv_rsq)
}
警告:虽然raw\u rsq
显然是正确的,而且cv\u rsq
在我预期的范围内,但请注意,我还没有确切检查crosval
函数的功能。因此,请自行承担使用风险,如果有人有任何反馈,我们将非常欢迎。它也只为带有截距和标准主效果符号的线性模型设计。我为此编写了一个函数。它也适用于标称预测值。它只适用于lm
对象(我认为),但可以很容易地扩展到glm
等
# from
# http://stackoverflow.com/a/16030020/3980197
# via http://www.statmethods.net/stats/regression.html
#' Calculate k fold cross validated r2
#'
#' Using k fold cross-validation, estimate the true r2 in a new sample. This is better than using adjusted r2 values.
#' @param lmfit (an lm fit) An lm fit object.
#' @param folds (whole number scalar) The number of folds to use (default 10).
#' @export
#' @examples
#' fit = lm("Petal.Length ~ Sepal.Length", data = iris)
#' MOD_k_fold_r2(fit)
MOD_k_fold_r2 = function(lmfit, folds = 10, runs = 100, seed = 1) {
library(magrittr)
#get data
data = lmfit$model
#seed
if (!is.na(seed)) set.seed(seed)
v_runs = sapply(1:runs, FUN = function(run) {
#Randomly shuffle the data
data2 = data[sample(nrow(data)), ]
#Create n equally size folds
folds_idx <- cut(seq(1, nrow(data2)), breaks = folds, labels = FALSE)
#Perform n fold cross validation
sapply(1:folds, function(i) {
#Segement your data by fold using the which() function
test_idx = which(folds_idx==i, arr.ind=TRUE)
test_data = data2[test_idx, ]
train_data = data2[-test_idx, ]
#weights
if ("(weights)" %in% data) {
wtds = train_data[["(weights)"]]
} else {
train_data$.weights = rep(1, nrow(train_data))
}
#fit
fit = lm(formula = lmfit$call$formula, data = train_data, weights = .weights)
#predict
preds = predict(fit, newdata = test_data)
#correlate to get r2
cor(preds, test_data[[1]], use = "p")^2
}) %>%
mean()
})
#return
c("raw_r2" = summary(lmfit)$r.squared, "cv_r2" = mean(v_runs))
}
在OP样本上:
> MOD_k_fold_r2(lmfit)
#raw_r2 cv_r2
# 0.724 0.718
关于stats.stackexchange(如和)的讨论认为应使用均方误差(MSE),而不是R^2
省去一次交叉验证(k-folds cv的特例,其中k=N)具有允许使用简单公式快速计算线性模型的cv MSE的特性。参见“R统计学习导论”第5.1.2节。以下代码应计算lm
模型的RMSE值(使用同一节中的方程式5.2):
您可以将其与“常规”RMSE进行比较:
或者从5倍或10倍交叉验证中获得的RMSE,我想。可能离题了。。很好,为什么?它是关于如何用这种语言实现一种统计技术,这种语言有近30000个问题。如果你愿意的话,我可以去掉这个问题的统计元素,只关注R的实现?看看@NPE,非常感谢。看来这会奏效的。一旦我应用了它,我将发布一个如何应用于上述示例的示例。@JeromyAnglim:如果你在这个网站上搜索DAAG
,你会发现一些相关的问题和答案。对于带有因子预测的模型,此函数会中断。例如:fit=lm(“萼片长度~Species”,数据=iris);k\u fold\u rsq(fit)
lsfit(x,y)中的错误:“x”中的NA/NaN/Inf另外:警告消息:lsfit(x,y)中:强制引入的NAs
我不确定如何通过交互实现这一点
raw_rsq cv_rsq
0.7237907 0.7050297
# from
# http://stackoverflow.com/a/16030020/3980197
# via http://www.statmethods.net/stats/regression.html
#' Calculate k fold cross validated r2
#'
#' Using k fold cross-validation, estimate the true r2 in a new sample. This is better than using adjusted r2 values.
#' @param lmfit (an lm fit) An lm fit object.
#' @param folds (whole number scalar) The number of folds to use (default 10).
#' @export
#' @examples
#' fit = lm("Petal.Length ~ Sepal.Length", data = iris)
#' MOD_k_fold_r2(fit)
MOD_k_fold_r2 = function(lmfit, folds = 10, runs = 100, seed = 1) {
library(magrittr)
#get data
data = lmfit$model
#seed
if (!is.na(seed)) set.seed(seed)
v_runs = sapply(1:runs, FUN = function(run) {
#Randomly shuffle the data
data2 = data[sample(nrow(data)), ]
#Create n equally size folds
folds_idx <- cut(seq(1, nrow(data2)), breaks = folds, labels = FALSE)
#Perform n fold cross validation
sapply(1:folds, function(i) {
#Segement your data by fold using the which() function
test_idx = which(folds_idx==i, arr.ind=TRUE)
test_data = data2[test_idx, ]
train_data = data2[-test_idx, ]
#weights
if ("(weights)" %in% data) {
wtds = train_data[["(weights)"]]
} else {
train_data$.weights = rep(1, nrow(train_data))
}
#fit
fit = lm(formula = lmfit$call$formula, data = train_data, weights = .weights)
#predict
preds = predict(fit, newdata = test_data)
#correlate to get r2
cor(preds, test_data[[1]], use = "p")^2
}) %>%
mean()
})
#return
c("raw_r2" = summary(lmfit)$r.squared, "cv_r2" = mean(v_runs))
}
fit = lm("Petal.Length ~ Species", data = iris)
MOD_k_fold_r2(fit)
#> raw_r2 cv_r2
#> 0.9413717 0.9398156
> MOD_k_fold_r2(lmfit)
#raw_r2 cv_r2
# 0.724 0.718
sqrt(sum((residuals(fit)/(1-hatvalues(fit)))^2)/length(fit$residuals))
summary(fit)$sigma