R 如何通过交叉验证生成混淆矩阵?

R 如何通过交叉验证生成混淆矩阵?,r,machine-learning,cross-validation,lda,R,Machine Learning,Cross Validation,Lda,我是R和机器学习的新手,我正在使用两个类的数据。我试图进行交叉验证,但当我试图制作模型的混淆矩阵时,我得到一个错误,即所有参数必须具有相同的长度。我不明白为什么我输入的长度不一样。任何方向正确的帮助都将不胜感激 library(MASS) xCV = x[sample(nrow(x)),] folds <- cut(seq(1,nrow(xCV)),breaks=10,labels=FALSE) for(i in 1:10){ testIndexes = which(folds=

我是R和机器学习的新手,我正在使用两个类的数据。我试图进行交叉验证,但当我试图制作模型的混淆矩阵时,我得到一个错误,即所有参数必须具有相同的长度。我不明白为什么我输入的长度不一样。任何方向正确的帮助都将不胜感激

library(MASS)
xCV = x[sample(nrow(x)),]

folds <- cut(seq(1,nrow(xCV)),breaks=10,labels=FALSE)

for(i in 1:10){

  testIndexes = which(folds==i,arr.ind=TRUE)
  testData = xCV[testIndexes, ]
  trainData = xCV[-testIndexes, ]

}
ldamodel = lda(class ~ ., trainData)
lda.predCV = predict(model)

conf.LDA.CV=table(trainData$class, lda.predCV$class)
print(conf.LDA.CV)

代码的问题在于,您没有在循环中进行建模和预测,您只是为i==10生成一个testindex,因为您覆盖了所有其他的testindex

将对iris数据执行以下代码:

library(MASS)
data(iris)
生成折叠:

set.seed(1)
folds <- sample(1:10, size = nrow(irisCV), replace = T) #5 fold CV
table(folds)
#output
folds
 1  2  3  4  5  6  7  8  9 10 
10 12 17 16 21 13 17 20 12 12
或者,如果您想要大小相同的折叠:

set.seed(1)
folds <- sample(rep(1:10, length.out = nrow(irisCV)), size = nrow(irisCV), replace = F)
table(folds)
#output
folds
 1  2  3  4  5  6  7  8  9 10 
15 15 15 15 15 15 15 15 15 15 
通过将模型折叠9次并在保持架上进行预测来运行模型:

CV_lda <- lapply(1:10, function(x){ 
  model <- lda(Species ~ ., iris[folds != x, ])
  preds <- predict(model,  iris[folds == x,], type="response")$class
  return(data.frame(preds, real = iris$Species[folds == x]))
})
这将生成一个保持预测列表,以将其组合到数据帧:

CV_lda <- do.call(rbind, CV_lda)
生成混淆矩阵:

library(caret)

confusionMatrix(CV_lda$preds, CV_lda$real)
#output
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         48         1
  virginica       0          2        49

Overall Statistics

               Accuracy : 0.98            
                 95% CI : (0.9427, 0.9959)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.97            
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9600           0.9800
Specificity                 1.0000            0.9900           0.9800
Pos Pred Value              1.0000            0.9796           0.9608
Neg Pred Value              1.0000            0.9802           0.9899
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3200           0.3267
Detection Prevalence        0.3333            0.3267           0.3400
Balanced Accuracy           1.0000            0.9750           0.9800

代码的问题在于,您没有在循环中进行建模和预测,您只是为i==10生成一个testindex,因为您覆盖了所有其他的testindex

将对iris数据执行以下代码:

library(MASS)
data(iris)
生成折叠:

set.seed(1)
folds <- sample(1:10, size = nrow(irisCV), replace = T) #5 fold CV
table(folds)
#output
folds
 1  2  3  4  5  6  7  8  9 10 
10 12 17 16 21 13 17 20 12 12
或者,如果您想要大小相同的折叠:

set.seed(1)
folds <- sample(rep(1:10, length.out = nrow(irisCV)), size = nrow(irisCV), replace = F)
table(folds)
#output
folds
 1  2  3  4  5  6  7  8  9 10 
15 15 15 15 15 15 15 15 15 15 
通过将模型折叠9次并在保持架上进行预测来运行模型:

CV_lda <- lapply(1:10, function(x){ 
  model <- lda(Species ~ ., iris[folds != x, ])
  preds <- predict(model,  iris[folds == x,], type="response")$class
  return(data.frame(preds, real = iris$Species[folds == x]))
})
这将生成一个保持预测列表,以将其组合到数据帧:

CV_lda <- do.call(rbind, CV_lda)
生成混淆矩阵:

library(caret)

confusionMatrix(CV_lda$preds, CV_lda$real)
#output
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         48         1
  virginica       0          2        49

Overall Statistics

               Accuracy : 0.98            
                 95% CI : (0.9427, 0.9959)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.97            
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9600           0.9800
Specificity                 1.0000            0.9900           0.9800
Pos Pred Value              1.0000            0.9796           0.9608
Neg Pred Value              1.0000            0.9802           0.9899
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3200           0.3267
Detection Prevalence        0.3333            0.3267           0.3400
Balanced Accuracy           1.0000            0.9750           0.9800

使用hglm.data中的种子数据集

打印出混淆矩阵和准确度:

conf <- table(pred=lda.predCV, actual=seedsCV$extract)
accuracy <- sum(diag(conf))/sum(conf)

> conf
          actual
pred       Bean Cucumber
  Bean       10        0
  Cucumber    0       11


> accuracy
[1] 1

使用hglm.data中的种子数据集

打印出混淆矩阵和准确度:

conf <- table(pred=lda.predCV, actual=seedsCV$extract)
accuracy <- sum(diag(conf))/sum(conf)

> conf
          actual
pred       Bean Cucumber
  Bean       10        0
  Cucumber    0       11


> accuracy
[1] 1