R 提供100%准确度的模型,随机森林,logit,C5.0?
当试图拟合模型来预测结果“死亡”时,我有100%的准确性,这显然是错误的。有人能告诉我我错过了什么吗R 提供100%准确度的模型,随机森林,logit,C5.0?,r,regression,random-forest,prediction,logarithm,R,Regression,Random Forest,Prediction,Logarithm,当试图拟合模型来预测结果“死亡”时,我有100%的准确性,这显然是错误的。有人能告诉我我错过了什么吗 library(caret) set.seed(100) intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE) training_Score <- riskFinal[intrain,] testing_Score <- riskFinal[-intrain,] control <- trai
library(caret)
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
control <- trainControl(method="repeatedcv", repeats=3, number=5)
#C5.0 decision tree
set.seed(100)
modelC50 <- train(death~., data=training_Score, method="C5.0",trControl=control)
summary(modelC50)
#Call:
#C5.0.default(x = structure(c(3, 4, 2, 30, 4, 12, 156, 0.0328767150640488, 36, 0.164383560419083, 22,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
# 0, 0, 0, 0,
#C5.0 [Release 2.07 GPL Edition] Tue Aug 4 10:23:10 2015
#-------------------------------
#Class specified by attribute `outcome'
#Read 27875 cases (23 attributes) from undefined.data
#21 attributes winnowed
#Estimated importance of remaining attributes:
#-2147483648% no.subjective.fevernofever
#Rules:
#Rule 1: (26982, lift 1.0)
# no.subjective.fevernofever <= 0
# -> class no [1.000]
#Rule 2: (893, lift 31.2)
# no.subjective.fevernofever > 0
# -> class yes [0.999]
#Default class: no
#Evaluation on training data (27875 cases):
# Rules
# ----------------
# No Errors
# 2 0( 0.0%) <<
# (a) (b) <-classified as
# ---- ----
# 26982 (a): class no
# 893 (b): class yes
# Attribute usage:
# 100.00% no.subjective.fevernofever
#Time: 0.1 secs
confusionMatrix(predictC50, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
编辑:根据评论,我意识到no.subjective.fever变量与目标变量death的值完全相同,因此我将其从模型中排除。然后我得到了更奇怪的结果:
随机森林
set.seed(100)
nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
summary(nmodelRF)
npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17988 595
# yes 0 0
#
# Accuracy : 0.968
# 95% CI : (0.9653, 0.9705)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5109
#
# Kappa : 0
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 1.000
# Specificity : 0.000
# Pos Pred Value : 0.968
# Neg Pred Value : NaN
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 1.000
# Balanced Accuracy : 0.500
#
# 'Positive' Class : no
Logit
set.seed(100)
nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
>summary(nmodelLOGIT)
# Call:
# NULL
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.5113 -0.2525 -0.2041 -0.1676 3.1698
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.432065 1.084942 2.242 0.024984 *
#age.in.months -0.001047 0.001293 -0.810 0.417874
#temp -0.168704 0.028815 -5.855 4.78e-09 ***
#genderfemale -0.053306 0.070468 -0.756 0.449375
#palloryes 0.282123 0.076518 3.687 0.000227 ***
#jaundiceyes 0.323755 0.144607 2.239 0.025165 *
#vomitingyes -0.533661 0.082948 -6.434 1.25e-10 ***
#diarrheayes -0.040272 0.080417 -0.501 0.616520
#dark.urineyes -0.583666 0.168787 -3.458 0.000544 ***
#intercostal.retractionyes -0.021717 0.129607 -0.168 0.866926
#subcostal.retractionyes 0.269588 0.128772 2.094 0.036301 *
#wheezingyes -0.587940 0.150475 -3.907 9.34e-05 ***
#rhonchiyes -0.008565 0.140095 -0.061 0.951249
#difficulty.breathingyes 0.397394 0.087789 4.527 5.99e-06 ***
#deep.breathingyes 0.399302 0.098761 4.043 5.28e-05 ***
#convulsionsyes 0.132609 0.094038 1.410 0.158491
#lethargyyes 0.338599 0.089934 3.765 0.000167 ***
#unable.to.sityes 0.452111 0.104556 4.324 1.53e-05 ***
#unable.to.drinkyes 0.516878 0.089685 5.763 8.25e-09 ***
#altered.consciousnessyes 0.433672 0.123288 3.518 0.000436 ***
#unconsciousnessyes 0.754012 0.136105 5.540 3.03e-08 ***
#meningeal.signsyes 0.188823 0.161088 1.172 0.241130
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 7902.5 on 27874 degrees of freedom
# Residual deviance: 7148.5 on 27853 degrees of freedom
# AIC: 7192.5
#
# Number of Fisher Scoring iterations: 6
npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
>confusionMatrix(npredictLOGIT, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17982 592
# yes 6 3
#
# Accuracy : 0.9678
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5605
#
# Kappa : 0.009
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 0.999666
# Specificity : 0.005042
# Pos Pred Value : 0.968127
# Neg Pred Value : 0.333333
# Prevalence : 0.967981
# Detection Rate : 0.967659
# Detection Prevalence : 0.999516
# Balanced Accuracy : 0.502354
#
# 'Positive' Class : no
set.seed(100)
nmodelRF NIR]:0.5109
#
#卡帕:0
#麦克内马尔试验P值:| z |)
#(截距)2.432065 1.084942 2.242 0.024984*
#月龄-0.001047 0.001293-0.810 0.417874
#温度-0.168704 0.028815-5.855 4.78e-09***
#性别女性-0.053306 0.070468-0.756 0.449375
#帕洛里斯0.282123 0.076518 3.687 0.000227***
#黄疸是0.323755 0.144607 2.239 0.025165*
#呕吐是-0.533661 0.082948-6.434 1.25e-10***
#腹泻是-0.040272 0.080417-0.501 0.616520
#深色。乌里尼耶-0.583666 0.168787-3.458 0.000544***
#肋间收缩是-0.021717 0.129607-0.168 0.866926
#副肋骨缩回是0.269588 0.128772 2.094 0.036301*
#喘息是-0.587940 0.150475-3.907 9.34e-05***
#rhonchiyes-0.008565 0.140095-0.061 0.951249
#呼吸困难是0.397394 0.087789 4.527 5.99e-06***
#深呼吸是0.399302 0.098761 4.043 5.28e-05***
#抽搐0.132609 0.094038 1.410 0.158491
#嗜睡0.338599 0.089934 3.765 0.000167***
#无法.to.sityes 0.452111 0.104556 4.324 1.53e-05***
#无法饮用0.516878 0.089685 5.763 8.25e-09***
#已更改。意识是0.433672 0.123288 3.518 0.000436***
#无意识是0.754012 0.136105 5.540 3.03e-08***
#脑膜征0.188823 0.161088 1.172 0.241130
# ---
#签名。代码:0'***'0.001'***'0.01'*'0.05'.'0.1''1
#
#(二项式族的离散参数取为1)
#
#零偏差:27874自由度上的7902.5
#剩余偏差:27853自由度上的7148.5
#AIC:7192.5
#
#Fisher评分迭代次数:6
npredictLOGITconfusionMatrix(npredictLOGIT,测试分数$death)
#混淆矩阵与统计
#
#参考文献
#预测否是
#编号17982592
#是的6 3
#
#准确度:0.9678
#95%可信区间:(0.9652,0.9703)
#无信息率:0.968
#P值[Acc>NIR]:0.5605
#
#卡帕值:0.009
#Mcnemar的测试P值:100%精度结果可能不正确。我假设它们是由于目标变量(或另一个与目标变量具有基本相同条目的变量,如@ulfelder在评论中指出的)包含在训练集和测试集中。通常,在模型构建和测试过程中需要删除这些列,因为它们表示描述分类的目标,而训练/测试数据应仅包含(希望)根据目标变量得出正确分类的信息
您可以尝试以下方法:
target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)
确定培训数据、交叉验证数据和测试数据中有多少数据点?这三组数据是独立的吗?C5.0模型的100%准确率几乎可以肯定,您在模型的培训/测试数据中错误地包含了目标变量。然后,代码丢弃所有其他变量,因为它认识到只有通过观察目标变量才能获得完美的分类。或者,如果不是目标变量本身,当观察到死亡时,可能一个或多个预测值系统性地改变(例如,温度变为0),因此目标泄漏到预测值中。你可以试着将预测值滞后一段时间(或领先目标),看看是否能得到更现实的结果。我检查了预测值,发现其中一个变量的值与结果完全相同,所以我从模型中取出了它,但我得到的结果几乎100%准确,几乎0%的特异性和几乎100%的敏感性。怎么了?我希望我没有添加一些混淆,但我有一个关于编辑的问题:如果您在培训集中删除no.subjective.fever
,该变量是否也应该从用于预测的测试分数集
中丢弃?插入符号包使用train()函数从模型中排除目标变量。为了确保我按照你的建议做了,并且在上一次编辑中得到了完全相同的结果,我指出我不熟悉caret软件包。我的评论是指我一直在使用的软件包,比如C50。如果我的分析不正确,我希望你能找到原因。在你的评论之后,我以一种更谨慎的方式重新表述了文本。即使这个答案不能为你的具体问题提供解决方案,我还是想把它公布出来,因为我认为它可能对其他人有用。我曾经犯过这样一个错误,我花了一些时间才弄清楚到底出了什么问题。
str(riskFinal)
#'data.frame': 46458 obs. of 23 variables:
# $ age.in.months : num 3 3 4 2 1.16 ...
# $ temp : num 35.5 39.4 36.8 35.2 35 34.3 37.2 35.2 34.6 35.3 ...
# $ gender : Factor w/ 2 levels "male","female": 1 2 2 2 1 1 1 2 1 1 ...
# $ no.subjective.fever : Factor w/ 2 levels "fever","nofever": 1 1 2 2 1 1 2 2 2 1 ...
# $ pallor : Factor w/ 2 levels "no","yes": 2 2 1 1 2 2 2 1 2 2 ...
# $ jaundice : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ vomiting : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
# $ diarrhea : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
# $ dark.urine : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ intercostal.retraction: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 1 2 ...
# $ subcostal.retraction : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
# $ wheezing : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ rhonchi : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
# $ difficulty.breathing : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 1 1 2 ...
# $ deep.breathing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 2 ...
# $ convulsions : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 2 1 2 2 ...
# $ lethargy : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unable.to.sit : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ unable.to.drink : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
# $ altered.consciousness : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unconsciousness : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ meningeal.signs : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 1 2 2 1 ...
# $ death : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 2 2 2 1 ...
set.seed(100)
nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
summary(nmodelRF)
npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17988 595
# yes 0 0
#
# Accuracy : 0.968
# 95% CI : (0.9653, 0.9705)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5109
#
# Kappa : 0
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 1.000
# Specificity : 0.000
# Pos Pred Value : 0.968
# Neg Pred Value : NaN
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 1.000
# Balanced Accuracy : 0.500
#
# 'Positive' Class : no
Logit
set.seed(100)
nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
>summary(nmodelLOGIT)
# Call:
# NULL
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.5113 -0.2525 -0.2041 -0.1676 3.1698
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.432065 1.084942 2.242 0.024984 *
#age.in.months -0.001047 0.001293 -0.810 0.417874
#temp -0.168704 0.028815 -5.855 4.78e-09 ***
#genderfemale -0.053306 0.070468 -0.756 0.449375
#palloryes 0.282123 0.076518 3.687 0.000227 ***
#jaundiceyes 0.323755 0.144607 2.239 0.025165 *
#vomitingyes -0.533661 0.082948 -6.434 1.25e-10 ***
#diarrheayes -0.040272 0.080417 -0.501 0.616520
#dark.urineyes -0.583666 0.168787 -3.458 0.000544 ***
#intercostal.retractionyes -0.021717 0.129607 -0.168 0.866926
#subcostal.retractionyes 0.269588 0.128772 2.094 0.036301 *
#wheezingyes -0.587940 0.150475 -3.907 9.34e-05 ***
#rhonchiyes -0.008565 0.140095 -0.061 0.951249
#difficulty.breathingyes 0.397394 0.087789 4.527 5.99e-06 ***
#deep.breathingyes 0.399302 0.098761 4.043 5.28e-05 ***
#convulsionsyes 0.132609 0.094038 1.410 0.158491
#lethargyyes 0.338599 0.089934 3.765 0.000167 ***
#unable.to.sityes 0.452111 0.104556 4.324 1.53e-05 ***
#unable.to.drinkyes 0.516878 0.089685 5.763 8.25e-09 ***
#altered.consciousnessyes 0.433672 0.123288 3.518 0.000436 ***
#unconsciousnessyes 0.754012 0.136105 5.540 3.03e-08 ***
#meningeal.signsyes 0.188823 0.161088 1.172 0.241130
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 7902.5 on 27874 degrees of freedom
# Residual deviance: 7148.5 on 27853 degrees of freedom
# AIC: 7192.5
#
# Number of Fisher Scoring iterations: 6
npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
>confusionMatrix(npredictLOGIT, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17982 592
# yes 6 3
#
# Accuracy : 0.9678
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5605
#
# Kappa : 0.009
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 0.999666
# Specificity : 0.005042
# Pos Pred Value : 0.968127
# Neg Pred Value : 0.333333
# Prevalence : 0.967981
# Detection Rate : 0.967659
# Detection Prevalence : 0.999516
# Balanced Accuracy : 0.502354
#
# 'Positive' Class : no
target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)