predict.lm（）如何计算置信区间和预测区间？_R_Regression_Linear Regression_Prediction_Lm

predict.lm（）如何计算置信区间和预测区间？

predict.lm（）如何计算置信区间和预测区间？,r,regression,linear-regression,prediction,lm,R,Regression,Linear Regression,Prediction,Lm,我运行了一个回归： CopierDataRegression <- lm(V1~V2, data=CopierData1) 我得到了（87.3,91.9）和（74.5,104.8），这似乎是正确的，因为PI应该更宽两者的输出也包括相同的se.fit=1.39我不明白这个标准错误是什么。PI与CI的标准误差不应该更大吗？如何在R中找到这两个不同的标准错误？数据： CopierData1我不知道是否有一种快速的方法来提取预测间隔的标准误差，但是您可以始终对SE的间隔进行反向求解（尽管

我运行了一个回归：

CopierDataRegression <- lm(V1~V2, data=CopierData1)

我得到了

（87.3,91.9）

和

（74.5,104.8）

，这似乎是正确的，因为PI应该更宽

两者的输出也包括相同的

se.fit=1.39

我不明白这个标准错误是什么。PI与CI的标准误差不应该更大吗？如何在R中找到这两个不同的标准错误？

数据：

CopierData1我不知道是否有一种快速的方法来提取预测间隔的标准误差，但是您可以始终对SE的间隔进行反向求解（尽管这不是一种非常优雅的方法）：
m当指定interval
和level
参数时，predict.lm
可以返回置信区间（CI）或预测区间（PI）。此答案显示了如何在不设置这些参数的情况下获取CI和PI。有两种方法：

使用来自predict.lm的中间阶段结果
一切从头开始

了解如何使用这两种方法可以让您彻底了解预测过程
请注意，我们只讨论predict.lm
的type=“response”
（默认）情况。对type=“terms”
的讨论超出了本答案的范围

安装程序
我在这里收集您的代码，以帮助其他读者复制、粘贴和运行。我还更改了变量名，以便它们具有更清晰的含义。此外，我将newdat
扩展为包含多行，以表明我们的计算是“矢量化的”
使用来自predict.lm的中间阶段结果
我们发现这与predict.lm（，interval=“confidence”）
一致
PI的标准误差是多少
PI比CI宽，因为它考虑了剩余方差：
variance_of_PI = variance_of_CI + variance_of_residual

请注意，这是按点定义的。对于非加权线性回归（如您的示例中所示），残差方差处处相等（称为齐次方差），并且它是z$restrain.scale^2
。因此，PI的标准误差为
se.PI <- sqrt(z$se.fit ^ 2 + z$residual.scale ^ 2)
#       1        2 
#9.022228 9.058082 

请注意，CI的构造不受回归类型的影响

从头做起
基本上，我们想知道如何在z
中获得fit
、se.fit
、df
和残差.scale

预测平均值可通过矩阵向量乘法Xp%*%b
计算，其中Xp
为线性预测矩阵，b
为回归系数向量
Xp <- model.matrix(delete.response(terms(lmObject)), newdat)
b <- coef(lmObject)
yh <- c(Xp %*% b)  ## c() reshape the single-column matrix to a vector
#[1]  89.63133 104.66658

计算逐点CI或PI不需要yh
的全方差协方差矩阵。我们只需要它的主对角线。因此，我们不必做diag（Xp%*%V%*%t（Xp））
，而可以通过
var.fit <- rowSums((Xp %*% V) * Xp)  ## point-wise variance for predicted mean
#       1        2 
#1.949963 2.598222 

sqrt(var.fit)  ## this agrees with `z$se.fit`
#       1        2 
#1.396411 1.611900 


附录：模仿predict.lm
“从头开始做每件事”中的代码被清晰地组织成一个函数，在这个问答中：。
查看？predict.lm
，它说：“se.fit
：预测平均值的标准误差”。“预测平均数”听起来似乎只适用于置信区间。如果您不想看到它，请设置se.fit=FALSE。谢谢。我想我要问的是，如何计算图片中的两个std错误？所以我可以验证计算结果，知道它们是如何推导出来的。这是有效的。I使用89.63+-t（0.95,43）xSE=下界对SE进行反解析，其中CI的下界为87.28，PI的下界为74.46。SE CI为1.39，SE PI为9.02。因此，预测区间的SE大于置信区间。但是我仍然不明白为什么预测区间的R中的输出会列出se.fit=1.39。为什么它没有列出9？谢谢简单是很优雅的。。。也是练习基本理解的好方法
predict(lmObject, newdat, se.fit = TRUE, interval = "confidence", level = 0.90)
#$fit
#        fit       lwr      upr
#1  89.63133  87.28387  91.9788
#2 104.66658 101.95686 107.3763
#
#$se.fit
#       1        2 
#1.396411 1.611900 
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508

predict(lmObject, newdat, se.fit = TRUE, interval = "prediction", level = 0.90)
#$fit
#        fit      lwr      upr
#1  89.63133 74.46433 104.7983
#2 104.66658 89.43930 119.8939
#
#$se.fit
#       1        2 
#1.396411 1.611900 
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508

## use `se.fit = TRUE`
z <- predict(lmObject, newdat, se.fit = TRUE)
#$fit
#        1         2 
# 89.63133 104.66658 
#
#$se.fit
#       1        2 
#1.396411 1.611900 
#
#$df
#[1] 43
#
#$residual.scale
#[1] 8.913508

alpha <- 0.90  ## 90%
Qt <- c(-1, 1) * qt((1 - alpha) / 2, z$df, lower.tail = FALSE)
#[1] -1.681071  1.681071

## 90% confidence interval
CI <- z$fit + outer(z$se.fit, Qt)
colnames(CI) <- c("lwr", "upr")
CI
#        lwr      upr
#1  87.28387  91.9788
#2 101.95686 107.3763

variance_of_PI = variance_of_CI + variance_of_residual

se.PI <- sqrt(z$se.fit ^ 2 + z$residual.scale ^ 2)
#       1        2 
#9.022228 9.058082 

PI <- z$fit + outer(se.PI, Qt)
colnames(PI) <- c("lwr", "upr")
PI
#       lwr      upr
#1 74.46433 104.7983
#2 89.43930 119.8939

 The prediction intervals are for a single observation at each case
 in ‘newdata’ (or by default, the data used for the fit) with error
 variance(s) ‘pred.var’.  This can be a multiple of ‘res.var’, the
 estimated value of sigma^2: the default is to assume that future
 observations have the same error variance as those used for
 fitting.  If ‘weights’ is supplied, the inverse of this is used as
 a scale factor.  For a weighted fit, if the prediction is for the
 original data frame, ‘weights’ defaults to the weights used for
 the model fit, with a warning since it might not be the intended
 result.  If the fit was weighted and ‘newdata’ is given, the
 default is to assume constant prediction variance, with a warning.

Xp <- model.matrix(delete.response(terms(lmObject)), newdat)
b <- coef(lmObject)
yh <- c(Xp %*% b)  ## c() reshape the single-column matrix to a vector
#[1]  89.63133 104.66658

V <- vcov(lmObject)  ## use `vcov` function in R
#             (Intercept)         V2
# (Intercept)    7.862086 -1.1927966
# V2            -1.192797  0.2333733

var.fit <- rowSums((Xp %*% V) * Xp)  ## point-wise variance for predicted mean
#       1        2 
#1.949963 2.598222 

sqrt(var.fit)  ## this agrees with `z$se.fit`
#       1        2 
#1.396411 1.611900 

dof <- df.residual(lmObject)
#[1] 43

sig2 <- c(crossprod(lmObject$residuals)) / dof
# [1] 79.45063

sqrt(sig2)  ## this agrees with `z$residual.scale`
#[1] 8.913508

sig2 <- c(crossprod(sqrt(lmObject$weights) * lmObject$residuals)) / dof