R-caret包(rpart):构建分类树

R-caret包(rpart):构建分类树,r,rpart,r-caret,R,Rpart,R Caret,我花了好几天的时间来使用插入符号包执行分类树。 问题是我的因素变量。我生成了树,但当我尝试使用最佳模型对测试样本进行预测时,它失败了,因为训练函数为我的因子变量创建了假人,然后预测函数无法在测试集中找到这些新创建的假人。我应该如何处理这个问题 我的代码如下: install.packages("caret", dependencies = c("Depends", "Suggests")) library(caret)

我花了好几天的时间来使用插入符号包执行分类树。 问题是我的因素变量。我生成了树,但当我尝试使用最佳模型对测试样本进行预测时,它失败了,因为训练函数为我的因子变量创建了假人,然后预测函数无法在测试集中找到这些新创建的假人。我应该如何处理这个问题

我的代码如下:

install.packages("caret", dependencies = c("Depends", "Suggests"))      
library(caret)                                      
db=data.frame(read.csv ("db.csv", head=TRUE, sep=";", na.strings ="?"))     
fix(db)
db$defaillance=factor(db$defaillance)
db$def=ifelse(db$defaillance==0,"No","Yes") 
db$def=factor(db$def)
db$defaillance=NULL
db$canal=factor(db$canal)
db$sect_isodev=factor(db$sect_isodev)
db$sect_risq=factor(db$sect_risq)       

#delete zero variance predictors                                
nzv <- nearZeroVar(db[,-78])
db_new <- db[,-nzv]

inTrain <- createDataPartition(y = db_new$def, p = .75, list = FALSE)                               
training <- db_new[inTrain,]
testing <- db_new[-inTrain,]
str(training)
str(testing)
dim(training)
dim(testing)
然后我的代码是这样的:

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                   classProbs = TRUE,
                   summaryFunction = twoClassSummary)

#CART1
set.seed(1234)
tree1 = train (def~.,
           training,
           method = "rpart",
           tuneLength=20,
           metric="ROC",
           trControl = fitControl)
在这儿吗

RNTB          38.397731
sect_isodev1   6.742289
sect_isodev3   4.005016
sect_isodev8   2.520850
sect_risq3     9.909127
sect_risq4     6.737908
sect_risq5     3.085714
SOLV          73.067539
TRES          47.906884
sect_isodev2   0.000000
sect_isodev4   0.000000
sect_isodev5   0.000000
sect_isodev6   0.000000
sect_isodev7   0.000000
sect_isodev9   0.000000
sect_risq0     0.000000
sect_risq1     0.000000
sect_risq2     0.000000
下面是错误:


model.tree1据我所知,有两个问题:

  • R无法为
    tree1$finalModel
    找到合适的
    predict
    函数,该函数应为
    predict.rpart
    ,因为
    tree1$finalModel
    属于
    rpart
    类。我也犯了这个错误,不幸的是,我不知道根本原因。这也是R不接受
    type=“class”
    的原因
    predict.rpart
    会接受它
  • train
    函数提供一个公式,而不是x和y对象,会导致以后找不到类似
    sect_isodev1
    的变量的问题
在使用x和y对象使用随机数据(类似于您的
str
)复制错误并从
rpart
显式调用
predict.rpart
后,我的做法是:

tree1 = train (y = training$def,
               x = training[, -which(colnames(training) == "def")],
               method = "rpart",
               tuneLength=20,
               metric="ROC",
               trControl = fitControl)
summary(tree1$finalModel)
# This still results in Error: could not find function "predict.rpart":
model.tree1 <- predict.rpart(tree1$finalModel, newdata = testing)
# Explicitly calling predict.rpart from the rpart package works:
rpart:::predict.rpart(object = tree1$finalModel, 
                      newdata = testing, 
                      type = "class") 
tree1=训练(y=训练$def,
x=训练[,-其中(colnames(training)=“def”)],
method=“rpart”,
tuneLength=20,
metric=“ROC”,
trControl=fitControl)
摘要(tree1$finalModel)
#这仍然会导致错误:找不到函数“predict.rpart”:

model.tree1除非有很好的理由,否则不要将
predict.rpart
train$finalModel
一起使用。
rpart
对象没有;我不知道
train
所做的任何事情,包括预处理。它可能不会给你正确的答案。毕竟,您可能正在使用
train
来避免细节,所以让
预测。train
做这项工作

马克斯

编辑-

关于
type=“class”
type=“prob”

predict.rpart
默认生成类概率。尽管
rpart
是最早的包之一,但这是非典型的,因为大多数默认情况下都生成类

predict.train
默认生成类,您必须使用
type=“prob”
获取概率

RNTB          38.397731
sect_isodev1   6.742289
sect_isodev3   4.005016
sect_isodev8   2.520850
sect_risq3     9.909127
sect_risq4     6.737908
sect_risq5     3.085714
SOLV          73.067539
TRES          47.906884
sect_isodev2   0.000000
sect_isodev4   0.000000
sect_isodev5   0.000000
sect_isodev6   0.000000
sect_isodev7   0.000000
sect_isodev9   0.000000
sect_risq0     0.000000
sect_risq1     0.000000
sect_risq2     0.000000
predict(rpartTune$finalModel, newdata, type = "class")
tree1 = train (y = training$def,
               x = training[, -which(colnames(training) == "def")],
               method = "rpart",
               tuneLength=20,
               metric="ROC",
               trControl = fitControl)
summary(tree1$finalModel)
# This still results in Error: could not find function "predict.rpart":
model.tree1 <- predict.rpart(tree1$finalModel, newdata = testing)
# Explicitly calling predict.rpart from the rpart package works:
rpart:::predict.rpart(object = tree1$finalModel, 
                      newdata = testing, 
                      type = "class")