R 随机林的高OOB错误率_R_Random Forest

R 随机林的高OOB错误率

R 随机林的高OOB错误率,r,random-forest,R,Random Forest,我试图开发一个模型来预测WaitingTime变量。我正在以下数据集上运行随机林 $ BookingId : Factor w/ 589855 levels "00002100-1E20-E411-BEB6-0050568C445E",..: 223781 471484 372126 141550 246376 512394 566217 38486 560536 485266 ... $ PickupLocality : int 1 67 77 -1 33 6

我试图开发一个模型来预测WaitingTime变量。我正在以下数据集上运行随机林

$ BookingId          : Factor w/ 589855 levels "00002100-1E20-E411-BEB6-0050568C445E",..: 223781 471484 372126 141550 246376 512394 566217 38486 560536 485266 ...
$ PickupLocality        : int  1 67 77 -1 33 69 67 67 67 67 ...
$ ExZone                : int  0 0 0 0 1 1 0 0 0 0 ...
$ BookingSource         : int  2 2 2 2 2 2 7 7 7 7 ...
$ StarCustomer          : int  1 1 1 1 1 1 1 1 1 1 ...
$ PickupZone            : int  24 0 0 0 6 11 0 0 0 0 ...
$ ScheduledStart_Day    : int  14 20 22 24 24 24 31 31 31 31 ...
$ ScheduledStart_Month  : int  6 6 6 6 6 6 7 7 7 7 ...
$ ScheduledStart_Hour   : int  14 17 7 2 8 8 1 2 2 2 ...
$ ScheduledStart_Minute : int  6 0 58 55 53 54 54 0 12 19 ...
$ ScheduledStart_WeekDay: int  1 7 2 4 4 4 6 6 6 6 ...
$ Season                : int  1 1 1 1 1 1 1 1 1 1 ...
$ Pax                   : int  1 3 2 4 2 2 2 4 1 4 ...
$ WaitingTime           : int  45 10 25 5 15 25 40 15 40 30 ...

我使用sample方法将数据集拆分为80%/20%的训练/测试子集，然后运行一个不包括BookingId因子的随机林。这仅用于验证预测

set.seed(1)
index <- sample(1:nrow(data),round(0.8*nrow(data)))

train <- data[index,]
test <- data[-index,]

library(randomForest)

extractFeatures <- function(data) {
  features <- c(    "PickupLocality",
        "BookingSource",
        "StarCustomer",
        "ScheduledStart_Month",
        "ScheduledStart_Day",
        "ScheduledStart_WeekDay",
        "ScheduledStart_Hour",
        "Season",
        "Pax")
  fea <- data[,features]
  return(fea)
}

rf <- randomForest(extractFeatures(train), as.factor(train$WaitingTime), ntree=600, mtry=2, importance=TRUE)

set.seed（1）
索引与随机森林超参数混为一谈几乎肯定不会显著提高您的性能
我建议对你的数据使用回归方法。因为等待时间不是绝对的，所以分类方法可能不太有效。您的分类模型会丢失5<10<15的排序信息，以此类推
首先要尝试的一件事是使用简单的线性回归。Bin测试集的预测值并重新计算精度。更好？更糟的如果更好的话，那就试试随机森林回归模型（或者我更喜欢梯度增强的机器）
其次，您的数据可能无法预测您感兴趣的变量。也许上游的数据有点混乱。首先计算预测因子与结果的相关性和/或互信息可能是一个很好的诊断方法
而且，有这么多分类标签，23%可能实际上并没有那么糟糕。基于随机猜测正确标记特定数据点的概率为N_class/N。因此，随机猜测模型的准确率不是50%。你可以通过计算来证明它比随机猜测更好 谢谢你的回答。我将按照您的指示进行恢复。您好，我对我的数据集进行了简单回归，得到了145.1712的sme。我还检查了相关性，发现变量之间没有相关性。我仍然需要计算调整后的兰德指数，尽管我想尝试其他算法，也许有一种算法可以返回更好的预测。
tempdata <- subset(tempdata, WaitingTime <= 40)
rndid <- with(tempdata, ave(tempdata$Season, tempdata$WaitingTime, FUN=function(x) {sample.int(length(x))}))

data <- tempdata[rndid<=27780,]