R 数据的插入符号列顺序会影响结果_R_R Caret

R 数据的插入符号列顺序会影响结果

R 数据的插入符号列顺序会影响结果,r,r-caret,R,R Caret,似乎使用相同的数据但列顺序不同会改变结果最小、可重复的示例： library(mlbench) data(Sonar) library(caret) set.seed(998) inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE) training <- Sonar[ inTraining,] testing <- Sonar[-inTraining,] fitControl <-

似乎使用相同的数据但列顺序不同会改变结果

最小、可重复的示例：

library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing  <- Sonar[-inTraining,]
fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)

set.seed(825)
gbmFit1 <- train(Class ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
gbmFit1

然后我试着：

finalVars <- colnames(training)
# reorder columns
finalVars <- finalVars[order(finalVars)]

set.seed(825)
gbmFit1 <- train(Class ~ ., data = training[, finalVars],
                 method = "gbm",
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
gbmFit1

从粗体数字可以看出，使用不同的列顺序会得到不同的结果

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
gbm_2.1.1
caret_6.0-68
mlbench_2.1-1

这个问题也适用于我检查过的其他几个模型：rpart、C5.0。有人知道为什么会发生这种情况吗？

不是用插入符号，而是用“gbm”算法本身。在“gbm”中，对列重新排序与更改种子大致相同

Stochastic Gradient Boosting 

157 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa    
  1                   50      0.7609191  0.5163703
  1                  100      0.7934216  0.5817734
  1                  150      0.7977230  0.5897796
  2                   50      0.7858235  0.5669550
  2                  100      **0.8194779**  **0.6331626**
  2                  150      **0.8207279**  **0.6354601**
  3                   50      **0.7946936**  **0.5831441**
  3                  100      0.8130564  0.6195719
  3                  150      0.8220931  0.6381234

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 150, interaction.depth =
 3, shrinkage = 0.1 and n.minobsinnode = 10.

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
gbm_2.1.1
caret_6.0-68
mlbench_2.1-1