R 随机森林出袋变量重要性

R 随机森林出袋变量重要性,r,random-forest,variance,R,Random Forest,Variance,我试图理解%var是如何超过100的 我正在使用脚本: require(randomForest) start <- "B_fixed" suffix <- ".txt" dataDir <- "/Users/Desktop/" mod1 <- read.table(paste(dataDir,start,suffix,sep=""),sep="\t",header=T) form <- as.formula(Ksat_f~.) Ksat_rf <

我试图理解%var是如何超过100的

我正在使用脚本:

require(randomForest)

start <- "B_fixed"
suffix <- ".txt"
dataDir <- "/Users/Desktop/"


mod1 <- read.table(paste(dataDir,start,suffix,sep=""),sep="\t",header=T)


form <- as.formula(Ksat_f~.)

Ksat_rf <- randomForest(form, data=mod1[c(1:14)],na.action=na.omit, ntree=1000, 
              replace=F,importance=T, do.trace=50, keep.forest=T,keep.inbag=T)
这是使用14个变量。。。。。如果我使用一个变量,%var可以得到145%

什么都可以

谢谢


-t

145%的数字告诉你,你的模型是错的远远多于对的

我承认这有点令人困惑。
%Var(y)
指的是误差相对于总目标方差的百分比方差。而
%Var解释:
指模型解释的百分比方差

注意:105.15%+(-5.15%)=100%

在下面的可复制示例中,我洗牌/排列目标(y),因此RF模型没有机会预测。您会发现它的性能非常差,因为误差超过100%,并且解释的方差小于0%。在解释方差为0%时,您的模型与预测任何等于总平均值的观测值具有相同的准确性

set.seed(1)

library(randomForest)
X <- data.frame(replicate(5,rnorm(1000)))
y <- apply(X,1,sum)
y <- sample(y)
Data <- data.frame(X,y)

form <- as.formula(y~.)

rf <- randomForest(form, data=Data,na.action=na.omit,
                   ntree=1000,replace=F,importance=T,
                   do.trace=50, keep.forest=T,keep.inbag=T)

     |      Out-of-bag   |
Tree |      MSE  %Var(y) |
  50 |     5.81   108.91 |
 100 |    5.671   106.31 |
 150 |    5.651   105.95 |
1000 |    5.609   105.15 |

print(rf)

Call:
 randomForest(formula = form, data = Data, ntree = 1000, replace = F,      importance = T, do.trace = 50, keep.forest = T, keep.inbag = T,      na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 1

          Mean of squared residuals: 5.608769
                    % Var explained: -5.15
set.seed(1)
图书馆(森林)

X请编辑您的帖子并使其成为一个完整的可复制示例(从加载所需包的
库调用开始)。您的y-var是什么样子的?@triBaker您的编辑很接近,但不完全可复制。您不能参考数据集,只有您有:)不客气:)顺便说一句,这是解释了袋外样品的差异,而不是可变的重要性。变量重要性(RF回归)是指在训练后、预测前由于给定变量的排列而导致的带外解释方差的减少。
set.seed(1)

library(randomForest)
X <- data.frame(replicate(5,rnorm(1000)))
y <- apply(X,1,sum)
y <- sample(y)
Data <- data.frame(X,y)

form <- as.formula(y~.)

rf <- randomForest(form, data=Data,na.action=na.omit,
                   ntree=1000,replace=F,importance=T,
                   do.trace=50, keep.forest=T,keep.inbag=T)

     |      Out-of-bag   |
Tree |      MSE  %Var(y) |
  50 |     5.81   108.91 |
 100 |    5.671   106.31 |
 150 |    5.651   105.95 |
1000 |    5.609   105.15 |

print(rf)

Call:
 randomForest(formula = form, data = Data, ntree = 1000, replace = F,      importance = T, do.trace = 50, keep.forest = T, keep.inbag = T,      na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 1

          Mean of squared residuals: 5.608769
                    % Var explained: -5.15