R 提供100%准确度的模型，随机森林，logit，C5.0？_R_Regression_Random Forest_Prediction_Logarithm

R 提供100%准确度的模型，随机森林，logit，C5.0？

R 提供100%准确度的模型，随机森林，logit，C5.0？,r,regression,random-forest,prediction,logarithm,R,Regression,Random Forest,Prediction,Logarithm,当试图拟合模型来预测结果“死亡”时，我有100%的准确性，这显然是错误的。有人能告诉我我错过了什么吗 library(caret) set.seed(100) intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE) training_Score <- riskFinal[intrain,] testing_Score <- riskFinal[-intrain,] control <- trai

当试图拟合模型来预测结果“死亡”时，我有100%的准确性，这显然是错误的。有人能告诉我我错过了什么吗

library(caret)
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]

control <- trainControl(method="repeatedcv", repeats=3, number=5)
#C5.0 decision tree
set.seed(100)
modelC50 <- train(death~., data=training_Score, method="C5.0",trControl=control)
summary(modelC50)

#Call:
#C5.0.default(x = structure(c(3, 4, 2, 30, 4, 12, 156, 0.0328767150640488, 36, 0.164383560419083, 22,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 


#C5.0 [Release 2.07 GPL Edition]    Tue Aug  4 10:23:10 2015
#-------------------------------

#Class specified by attribute `outcome'

#Read 27875 cases (23 attributes) from undefined.data

#21 attributes winnowed
#Estimated importance of remaining attributes:

#-2147483648%  no.subjective.fevernofever

#Rules:

#Rule 1: (26982, lift 1.0)
#   no.subjective.fevernofever <= 0
#   ->  class no  [1.000]

#Rule 2: (893, lift 31.2)
#   no.subjective.fevernofever > 0
#   ->  class yes  [0.999]

#Default class: no


#Evaluation on training data (27875 cases):

#           Rules     
#     ----------------
#       No      Errors

#        2    0( 0.0%)   <<


#      (a)   (b)    <-classified as
#     ----  ----
#    26982          (a): class no
#            893    (b): class yes


#   Attribute usage:

#   100.00% no.subjective.fevernofever


#Time: 0.1 secs


confusionMatrix(predictC50, testing_Score$death)

#Confusion Matrix and Statistics

#          Reference
#Prediction    no   yes
#       no  17988     0
#       yes     0   595

#               Accuracy : 1          
#                 95% CI : (0.9998, 1)
#    No Information Rate : 0.968      
#    P-Value [Acc > NIR] : < 2.2e-16  

#                  Kappa : 1          
# Mcnemar's Test P-Value : NA         

#            Sensitivity : 1.000      
#            Specificity : 1.000      
#         Pos Pred Value : 1.000      
#         Neg Pred Value : 1.000      
#             Prevalence : 0.968      
#         Detection Rate : 0.968      
#   Detection Prevalence : 0.968      
#      Balanced Accuracy : 1.000      

#       'Positive' Class : no

编辑：根据评论，我意识到no.subjective.fever变量与目标变量death的值完全相同，因此我将其从模型中排除。然后我得到了更奇怪的结果：

随机森林

set.seed(100)
        nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
        summary(nmodelRF)
        npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)


 # Confusion Matrix and Statistics
   # 
   #           Reference
   # Prediction    no   yes
   #        no  17988   595
   #        yes     0     0
   #                                           
   #               Accuracy : 0.968           
   #                  95% CI : (0.9653, 0.9705)
   #     No Information Rate : 0.968           
   #     P-Value [Acc > NIR] : 0.5109          
   #                                           
   #                   Kappa : 0               
   #  Mcnemar's Test P-Value : <2e-16          
   #                                           
   #             Sensitivity : 1.000           
   #             Specificity : 0.000           
   #          Pos Pred Value : 0.968           
   #          Neg Pred Value :   NaN           
   #              Prevalence : 0.968           
   #          Detection Rate : 0.968           
   #    Detection Prevalence : 1.000           
   #       Balanced Accuracy : 0.500           
   #                                           
   #        'Positive' Class : no 


Logit

set.seed(100)
        nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
        >summary(nmodelLOGIT)

# Call:
#         NULL
# 
# Deviance Residuals: 
#         Min       1Q   Median       3Q      Max  
# -1.5113  -0.2525  -0.2041  -0.1676   3.1698  
# 
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)    
# (Intercept)                2.432065   1.084942   2.242 0.024984 *  
#age.in.months             -0.001047   0.001293  -0.810 0.417874    
#temp                      -0.168704   0.028815  -5.855 4.78e-09 ***
#genderfemale              -0.053306   0.070468  -0.756 0.449375    
#palloryes                  0.282123   0.076518   3.687 0.000227 ***
#jaundiceyes                0.323755   0.144607   2.239 0.025165 *  
#vomitingyes               -0.533661   0.082948  -6.434 1.25e-10 ***
#diarrheayes               -0.040272   0.080417  -0.501 0.616520    
#dark.urineyes             -0.583666   0.168787  -3.458 0.000544 ***
#intercostal.retractionyes -0.021717   0.129607  -0.168 0.866926    
#subcostal.retractionyes    0.269588   0.128772   2.094 0.036301 *  
#wheezingyes               -0.587940   0.150475  -3.907 9.34e-05 ***
#rhonchiyes                -0.008565   0.140095  -0.061 0.951249    
#difficulty.breathingyes    0.397394   0.087789   4.527 5.99e-06 ***
#deep.breathingyes          0.399302   0.098761   4.043 5.28e-05 ***
#convulsionsyes             0.132609   0.094038   1.410 0.158491    
#lethargyyes                0.338599   0.089934   3.765 0.000167 ***
#unable.to.sityes           0.452111   0.104556   4.324 1.53e-05 ***
#unable.to.drinkyes         0.516878   0.089685   5.763 8.25e-09 ***
#altered.consciousnessyes   0.433672   0.123288   3.518 0.000436 ***
#unconsciousnessyes         0.754012   0.136105   5.540 3.03e-08 ***
#meningeal.signsyes         0.188823   0.161088   1.172 0.241130    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
# Null deviance: 7902.5  on 27874  degrees of freedom
# Residual deviance: 7148.5  on 27853  degrees of freedom
# AIC: 7192.5
# 
# Number of Fisher Scoring iterations: 6

npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
        >confusionMatrix(npredictLOGIT, testing_Score$death)

# Confusion Matrix and Statistics
# 
# Reference
# Prediction    no   yes
# no  17982   592
# yes     6     3
# 
# Accuracy : 0.9678          
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968           
# P-Value [Acc > NIR] : 0.5605          
# 
# Kappa : 0.009           
# Mcnemar's Test P-Value : <2e-16          
# 
# Sensitivity : 0.999666        
# Specificity : 0.005042        
# Pos Pred Value : 0.968127        
# Neg Pred Value : 0.333333        
# Prevalence : 0.967981        
# Detection Rate : 0.967659        
# Detection Prevalence : 0.999516        
# Balanced Accuracy : 0.502354        
# 
# 'Positive' Class : no

set.seed（100）
nmodelRF NIR]：0.5109
#                                           
#卡帕：0
#麦克内马尔试验P值：| z |）
#（截距）2.432065 1.084942 2.242 0.024984*
#月龄-0.001047 0.001293-0.810 0.417874
#温度-0.168704 0.028815-5.855 4.78e-09***
#性别女性-0.053306 0.070468-0.756 0.449375
#帕洛里斯0.282123 0.076518 3.687 0.000227***
#黄疸是0.323755 0.144607 2.239 0.025165*
#呕吐是-0.533661 0.082948-6.434 1.25e-10***
#腹泻是-0.040272 0.080417-0.501 0.616520
#深色。乌里尼耶-0.583666 0.168787-3.458 0.000544***
#肋间收缩是-0.021717 0.129607-0.168 0.866926
#副肋骨缩回是0.269588 0.128772 2.094 0.036301*
#喘息是-0.587940 0.150475-3.907 9.34e-05***
#rhonchiyes-0.008565 0.140095-0.061 0.951249
#呼吸困难是0.397394 0.087789 4.527 5.99e-06***
#深呼吸是0.399302 0.098761 4.043 5.28e-05***
#抽搐0.132609 0.094038 1.410 0.158491
#嗜睡0.338599 0.089934 3.765 0.000167***
#无法.to.sityes 0.452111 0.104556 4.324 1.53e-05***
#无法饮用0.516878 0.089685 5.763 8.25e-09***
#已更改。意识是0.433672 0.123288 3.518 0.000436***
#无意识是0.754012 0.136105 5.540 3.03e-08***
#脑膜征0.188823 0.161088 1.172 0.241130
# ---
#签名。代码：0'***'0.001'***'0.01'*'0.05'.'0.1''1
# 
#（二项式族的离散参数取为1）
# 
#零偏差：27874自由度上的7902.5
#剩余偏差：27853自由度上的7148.5
#AIC:7192.5
# 
#Fisher评分迭代次数：6
npredictLOGITconfusionMatrix（npredictLOGIT，测试分数$death）
#混淆矩阵与统计
# 
#参考文献
#预测否是
#编号17982592
#是的6 3
# 
#准确度：0.9678
#95%可信区间：（0.9652,0.9703）
#无信息率：0.968
#P值[Acc>NIR]：0.5605
# 
#卡帕值：0.009
#Mcnemar的测试P值：100%精度结果可能不正确。我假设它们是由于目标变量（或另一个与目标变量具有基本相同条目的变量，如@ulfelder在评论中指出的）包含在训练集和测试集中。通常，在模型构建和测试过程中需要删除这些列，因为它们表示描述分类的目标，而训练/测试数据应仅包含（希望）根据目标变量得出正确分类的信息
您可以尝试以下方法：
target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)

确定培训数据、交叉验证数据和测试数据中有多少数据点？这三组数据是独立的吗？C5.0模型的100%准确率几乎可以肯定，您在模型的培训/测试数据中错误地包含了目标变量。然后，代码丢弃所有其他变量，因为它认识到只有通过观察目标变量才能获得完美的分类。或者，如果不是目标变量本身，当观察到死亡时，可能一个或多个预测值系统性地改变（例如，温度变为0），因此目标泄漏到预测值中。你可以试着将预测值滞后一段时间（或领先目标），看看是否能得到更现实的结果。我检查了预测值，发现其中一个变量的值与结果完全相同，所以我从模型中取出了它，但我得到的结果几乎100%准确，几乎0%的特异性和几乎100%的敏感性。怎么了？我希望我没有添加一些混淆，但我有一个关于编辑的问题：如果您在培训集中删除no.subjective.fever
，该变量是否也应该从用于预测的测试分数集中丢弃？插入符号包使用train（）函数从模型中排除目标变量。为了确保我按照你的建议做了，并且在上一次编辑中得到了完全相同的结果，我指出我不熟悉caret软件包。我的评论是指我一直在使用的软件包，比如C50。如果我的分析不正确，我希望你能找到原因。在你的评论之后，我以一种更谨慎的方式重新表述了文本。即使这个答案不能为你的具体问题提供解决方案，我还是想把它公布出来，因为我认为它可能对其他人有用。我曾经犯过这样一个错误，我花了一些时间才弄清楚到底出了什么问题。
str(riskFinal)
#'data.frame':  46458 obs. of  23 variables:
# $ age.in.months         : num  3 3 4 2 1.16 ...
# $ temp                  : num  35.5 39.4 36.8 35.2 35 34.3 37.2 35.2 34.6 35.3 ...
# $ gender                : Factor w/ 2 levels "male","female": 1 2 2 2 1 1 1 2 1 1 ...
# $ no.subjective.fever   : Factor w/ 2 levels "fever","nofever": 1 1 2 2 1 1 2 2 2 1 ...
# $ pallor                : Factor w/ 2 levels "no","yes": 2 2 1 1 2 2 2 1 2 2 ...
# $ jaundice              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ vomiting              : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
# $ diarrhea              : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
# $ dark.urine            : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ intercostal.retraction: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 1 2 ...
# $ subcostal.retraction  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
# $ wheezing              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ rhonchi               : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
# $ difficulty.breathing  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 1 1 2 ...
# $ deep.breathing        : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 2 ...
# $ convulsions           : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 2 1 2 2 ...
# $ lethargy              : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unable.to.sit         : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ unable.to.drink       : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
# $ altered.consciousness : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unconsciousness       : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ meningeal.signs       : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 1 2 2 1 ...
# $ death                 : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 2 2 2 1 ...

set.seed(100)
        nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
        summary(nmodelRF)
        npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)


 # Confusion Matrix and Statistics
   # 
   #           Reference
   # Prediction    no   yes
   #        no  17988   595
   #        yes     0     0
   #                                           
   #               Accuracy : 0.968           
   #                  95% CI : (0.9653, 0.9705)
   #     No Information Rate : 0.968           
   #     P-Value [Acc > NIR] : 0.5109          
   #                                           
   #                   Kappa : 0               
   #  Mcnemar's Test P-Value : <2e-16          
   #                                           
   #             Sensitivity : 1.000           
   #             Specificity : 0.000           
   #          Pos Pred Value : 0.968           
   #          Neg Pred Value :   NaN           
   #              Prevalence : 0.968           
   #          Detection Rate : 0.968           
   #    Detection Prevalence : 1.000           
   #       Balanced Accuracy : 0.500           
   #                                           
   #        'Positive' Class : no 


Logit

set.seed(100)
        nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
        >summary(nmodelLOGIT)

# Call:
#         NULL
# 
# Deviance Residuals: 
#         Min       1Q   Median       3Q      Max  
# -1.5113  -0.2525  -0.2041  -0.1676   3.1698  
# 
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)    
# (Intercept)                2.432065   1.084942   2.242 0.024984 *  
#age.in.months             -0.001047   0.001293  -0.810 0.417874    
#temp                      -0.168704   0.028815  -5.855 4.78e-09 ***
#genderfemale              -0.053306   0.070468  -0.756 0.449375    
#palloryes                  0.282123   0.076518   3.687 0.000227 ***
#jaundiceyes                0.323755   0.144607   2.239 0.025165 *  
#vomitingyes               -0.533661   0.082948  -6.434 1.25e-10 ***
#diarrheayes               -0.040272   0.080417  -0.501 0.616520    
#dark.urineyes             -0.583666   0.168787  -3.458 0.000544 ***
#intercostal.retractionyes -0.021717   0.129607  -0.168 0.866926    
#subcostal.retractionyes    0.269588   0.128772   2.094 0.036301 *  
#wheezingyes               -0.587940   0.150475  -3.907 9.34e-05 ***
#rhonchiyes                -0.008565   0.140095  -0.061 0.951249    
#difficulty.breathingyes    0.397394   0.087789   4.527 5.99e-06 ***
#deep.breathingyes          0.399302   0.098761   4.043 5.28e-05 ***
#convulsionsyes             0.132609   0.094038   1.410 0.158491    
#lethargyyes                0.338599   0.089934   3.765 0.000167 ***
#unable.to.sityes           0.452111   0.104556   4.324 1.53e-05 ***
#unable.to.drinkyes         0.516878   0.089685   5.763 8.25e-09 ***
#altered.consciousnessyes   0.433672   0.123288   3.518 0.000436 ***
#unconsciousnessyes         0.754012   0.136105   5.540 3.03e-08 ***
#meningeal.signsyes         0.188823   0.161088   1.172 0.241130    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
# Null deviance: 7902.5  on 27874  degrees of freedom
# Residual deviance: 7148.5  on 27853  degrees of freedom
# AIC: 7192.5
# 
# Number of Fisher Scoring iterations: 6

npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
        >confusionMatrix(npredictLOGIT, testing_Score$death)

# Confusion Matrix and Statistics
# 
# Reference
# Prediction    no   yes
# no  17982   592
# yes     6     3
# 
# Accuracy : 0.9678          
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968           
# P-Value [Acc > NIR] : 0.5605          
# 
# Kappa : 0.009           
# Mcnemar's Test P-Value : <2e-16          
# 
# Sensitivity : 0.999666        
# Specificity : 0.005042        
# Pos Pred Value : 0.968127        
# Neg Pred Value : 0.333333        
# Prevalence : 0.967981        
# Detection Rate : 0.967659        
# Detection Prevalence : 0.999516        
# Balanced Accuracy : 0.502354        
# 
# 'Positive' Class : no  

target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)