R logistic回归的混淆矩阵

R logistic回归的混淆矩阵,r,machine-learning,R,Machine Learning,我试图对提供的数据集进行逻辑回归 通过使用5倍交叉验证 我的目标是对数据集的分类列进行预测,该列可以取值1(如果没有癌症)和值2(如果癌症) 以下是完整的代码: library(ISLR) library(boot) dataCancer <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv") #Randomly shuffle

我试图对提供的数据集进行逻辑回归 通过使用5倍交叉验证

我的目标是对数据集的分类列进行预测,该列可以取值1(如果没有癌症)和值2(如果癌症)

以下是完整的代码:

     library(ISLR)
     library(boot)
     dataCancer <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")

     #Randomly shuffle the data
     dataCancer<-dataCancer[sample(nrow(dataCancer)),]
     #Create 5 equally size folds
     folds <- cut(seq(1,nrow(dataCancer)),breaks=5,labels=FALSE)
     #Perform 5 fold cross validation
     for(i in 1:5){
           #Segement your data by fold using the which() function 
           testIndexes <- which(folds == i)
           testData <- dataCancer[testIndexes, ]
           trainData <- dataCancer[-testIndexes, ]
           #Use the test and train data partitions however you desire...

           classification_model = glm(as.factor(Classification) ~ ., data = trainData,family = binomial)
           summary(classification_model)

           #Use the fitted model to do predictions for the test data
           model_pred_probs = predict(classification_model , testData , type = "response")
           model_predict_classification = rep(0 , length(testData))
           model_predict_classification[model_pred_probs > 0.5] = 1

           #Create the confusion matrix and compute the misclassification rate
           table(model_predict_classification , testData)
           mean(model_predict_classification != testData)
     }
我得到以下错误:

 Error in table(model_predict_classification, testData) : all arguments must have the same length
我不太明白如何使用混淆矩阵

我想有5个错误分类率。trainData和testData被切割成5段。大小应等于模型的大小


感谢您的帮助。

这里有一个解决方案,在将癌症数据拆分为测试和训练数据集后,使用
插入符号
包对其执行5倍交叉验证。根据测试和训练数据生成混淆矩阵

插入符号::train()
报告了5个突出折叠的平均精度。可以通过从输出模型对象中提取每个单独折叠的结果

library(caret)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")
# set classification as factor, and recode to 
# 0 = no cancer, 1 = cancer 
data$Classification <- as.factor((data$Classification - 1))
# split data into training and test, based on values of dependent variable 
trainIndex <- createDataPartition(data$Classification, p = .75,list=FALSE)
training <- data[trainIndex,]
testing <- data[-trainIndex,]
trCntl <- trainControl(method = "CV",number = 5)
glmModel <- train(Classification ~ .,data = training,trControl = trCntl,method="glm",family = "binomial")
# print the model info
summary(glmModel)
glmModel
confusionMatrix(glmModel)
# generate predictions on hold back data
trainPredicted <- predict(glmModel,testing)
# generate confusion matrix for hold back data
confusionMatrix(trainPredicted,reference=testing$Classification)
库(插入符号)
数据最小1季度中值最大3季度
> -2.1542  -0.8358   0.2605   0.8260   2.1009  
> 
>系数:
>估计标准误差z值Pr(>| z |)(截距)-4.4039248 3.9159157-1.125 0.2607年龄-0.0190241 0.0177119-1.074 0.2828体重指数-0.1257962 0.0749341-1.679 0.0932。葡萄糖0.0912229 0.0389587 2.342 0.0192*胰岛素0.0917095 0.2889870 0.317 0.7510 HOMA-0.1820392 1.2139114-0.150 0.8808瘦素-0.020760 0.0195192-1.064 0.2875脂联素-0.0158448 0.0401506-0.395 0.6931抵抗素0.0419178 0.02555536 1.640 0.1009 MCP.1 0.00046720.0009093   0.514   0.6074  
>---签名。代码:0'***'0.001'***'0.01'*'0.05'.'0.1''1
> 
>(二项式族的离散参数取为1)
> 
>零偏差:86自由度时为119.675剩余偏差:77自由度时为89.804 AIC:109.8
> 
>Fisher评分迭代次数:7
> 
>>glmModel广义线性模型
> 
>87个样本9个预测值2类:“0”、“1”
> 
>无预处理重采样:交叉验证(5倍)汇总
>样本量:70,69,70,69,70重采样结果:
> 
>精度卡帕
>   0.7143791  0.4356231
> 
>>混淆矩阵(glmModel)交叉验证(5倍)混淆矩阵
> 
>(条目是重采样的百分比平均细胞计数)
>  
>参考预测0 1
>          0 33.3 17.2
>          1 11.5 37.9
>准确度(平均值):0.7126
> 
>>#生成对保留数据的预测
>>列车预测>#生成滞留数据的混淆矩阵
>>混淆矩阵(trainPredicted,参考=测试$分类)混淆矩阵和统计
> 
>参考预测0 1
>          0 11  2
>          1  2 14
>                                           
>准确度:0.8621
>95%可信区间:(0.6834,0.9611)
>无信息率:0.5517
>P值[Acc>NIR]:0.0004078
>                                           
>Kappa:0.7212 Mcnemar试验P值:1.0000000
>                                           
>灵敏度:0.8462
>特异性:0.8750
>位置预测值:0.8462
>负预测值:0.8750
>患病率:0.4483
>检出率:0.3793检出率:0.4483
>平衡精度:0.8606
>                                           
>“正”类:0
>                                           
> >

感谢您宝贵的帮助。如果我很了解confusionMatrix(glmModel),则给出数据集的当前精度,而confusionMatrix(trainPredicted,reference=testing$Classification)则给出训练数据的精度。我错了吗?@Ilan-第一个混淆矩阵使用训练数据,第二个使用测试数据,而不是用来创建模型。根据未用于构建模型的数据验证模型有助于我们查看模型是否与训练数据拟合过度。如果我想将代码重新用于回归树分类。我只需要用treeModel替换glmModel是的,但是如果您使用plot(treeModel$finalModel,uniform=TRUE,margin=0.3)文本(treeModel$finalModel,use.n=TRUE,all=TRUE,cex=0.9),那么树形图将更容易阅读,谢谢。如果我使用插入符号库,我可以使用模型选择来选择相关的特征子集,并绘制错误率与灵活性曲线(选择最佳灵活性水平)?
library(caret)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")
# set classification as factor, and recode to 
# 0 = no cancer, 1 = cancer 
data$Classification <- as.factor((data$Classification - 1))
# split data into training and test, based on values of dependent variable 
trainIndex <- createDataPartition(data$Classification, p = .75,list=FALSE)
training <- data[trainIndex,]
testing <- data[-trainIndex,]
trCntl <- trainControl(method = "CV",number = 5)
glmModel <- train(Classification ~ .,data = training,trControl = trCntl,method="glm",family = "binomial")
# print the model info
summary(glmModel)
glmModel
confusionMatrix(glmModel)
# generate predictions on hold back data
trainPredicted <- predict(glmModel,testing)
# generate confusion matrix for hold back data
confusionMatrix(trainPredicted,reference=testing$Classification)
> # print the model info
> > summary(glmModel)
> 
> Call: NULL
> 
> Deviance Residuals: 
>     Min       1Q   Median       3Q      Max  
> -2.1542  -0.8358   0.2605   0.8260   2.1009  
> 
> Coefficients:
>               Estimate Std. Error z value Pr(>|z|)   (Intercept) -4.4039248  3.9159157  -1.125   0.2607   Age         -0.0190241  0.0177119  -1.074   0.2828   BMI         -0.1257962  0.0749341  -1.679   0.0932 . Glucose      0.0912229  0.0389587   2.342   0.0192 * Insulin      0.0917095  0.2889870   0.317   0.7510   HOMA        -0.1820392  1.2139114  -0.150   0.8808   Leptin      -0.0207606  0.0195192  -1.064   0.2875   Adiponectin -0.0158448  0.0401506  -0.395   0.6931   Resistin     0.0419178  0.0255536   1.640   0.1009   MCP.1        0.0004672  0.0009093   0.514   0.6074  
> --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> (Dispersion parameter for binomial family taken to be 1)
> 
>     Null deviance: 119.675  on 86  degrees of freedom Residual deviance:  89.804  on 77  degrees of freedom AIC: 109.8
> 
> Number of Fisher Scoring iterations: 7
> 
> > glmModel Generalized Linear Model 
> 
> 87 samples  9 predictor  2 classes: '0', '1' 
> 
> No pre-processing Resampling: Cross-Validated (5 fold)  Summary of
> sample sizes: 70, 69, 70, 69, 70  Resampling results:
> 
>   Accuracy   Kappa    
>   0.7143791  0.4356231
> 
> > confusionMatrix(glmModel) Cross-Validated (5 fold) Confusion Matrix 
> 
> (entries are percentual average cell counts across resamples)
>  
>           Reference Prediction    0    1
>          0 33.3 17.2
>          1 11.5 37.9
>                               Accuracy (average) : 0.7126
> 
> > # generate predictions on hold back data
> > trainPredicted <- predict(glmModel,testing)
> > # generate confusion matrix for hold back data
> > confusionMatrix(trainPredicted,reference=testing$Classification) Confusion Matrix and Statistics
> 
>           Reference Prediction  0  1
>          0 11  2
>          1  2 14
>                                           
>                Accuracy : 0.8621          
>                  95% CI : (0.6834, 0.9611)
>     No Information Rate : 0.5517          
>     P-Value [Acc > NIR] : 0.0004078       
>                                           
>                   Kappa : 0.7212            Mcnemar's Test P-Value : 1.0000000       
>                                           
>             Sensitivity : 0.8462          
>             Specificity : 0.8750          
>          Pos Pred Value : 0.8462          
>          Neg Pred Value : 0.8750          
>              Prevalence : 0.4483          
>          Detection Rate : 0.3793              Detection Prevalence : 0.4483          
>       Balanced Accuracy : 0.8606          
>                                           
>        'Positive' Class : 0               
>                                           
> >