R 数据的插入符号列顺序会影响结果
似乎使用相同的数据但列顺序不同会改变结果 最小、可重复的示例:R 数据的插入符号列顺序会影响结果,r,r-caret,R,R Caret,似乎使用相同的数据但列顺序不同会改变结果 最小、可重复的示例: library(mlbench) data(Sonar) library(caret) set.seed(998) inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE) training <- Sonar[ inTraining,] testing <- Sonar[-inTraining,] fitControl <-
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
set.seed(825)
gbmFit1 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
然后我试着:
finalVars <- colnames(training)
# reorder columns
finalVars <- finalVars[order(finalVars)]
set.seed(825)
gbmFit1 <- train(Class ~ ., data = training[, finalVars],
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
从粗体数字可以看出,使用不同的列顺序会得到不同的结果
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
gbm_2.1.1
caret_6.0-68
mlbench_2.1-1
这个问题也适用于我检查过的其他几个模型:rpart、C5.0。有人知道为什么会发生这种情况吗?不是用插入符号,而是用“gbm”算法本身。在“gbm”中,对列重新排序与更改种子大致相同
Stochastic Gradient Boosting
157 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ...
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa
1 50 0.7609191 0.5163703
1 100 0.7934216 0.5817734
1 150 0.7977230 0.5897796
2 50 0.7858235 0.5669550
2 100 **0.8194779** **0.6331626**
2 150 **0.8207279** **0.6354601**
3 50 **0.7946936** **0.5831441**
3 100 0.8130564 0.6195719
3 150 0.8220931 0.6381234
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth =
3, shrinkage = 0.1 and n.minobsinnode = 10.
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
gbm_2.1.1
caret_6.0-68
mlbench_2.1-1