R 如何检测xgboost中的过度拟合(来自测试auc分数)

R 如何检测xgboost中的过度拟合(来自测试auc分数),r,machine-learning,data-science,xgboost,auc,R,Machine Learning,Data Science,Xgboost,Auc,我试图理解如何构建预测模型,最近在R中遇到xgboost包,并试图使用Titanic数据集实现它。我建立了一个模型,现在我想知道如何检测我的模型是否过度拟合,选择多少轮,以及这是基于训练错误还是测试错误 代码如下: #Load Dataset titanic.train <- read.csv("D:/Data/titanic/train.csv") titanic.test <- read.csv("D:/Data/titanic/test.csv") PassengerId=ti

我试图理解如何构建预测模型,最近在R中遇到xgboost包,并试图使用Titanic数据集实现它。我建立了一个模型,现在我想知道如何检测我的模型是否过度拟合,选择多少轮,以及这是基于训练错误还是测试错误

代码如下:

#Load Dataset
titanic.train <- read.csv("D:/Data/titanic/train.csv")
titanic.test <- read.csv("D:/Data/titanic/test.csv")
PassengerId=titanic.test$PassengerId
head(titanic.train)

#Create columns to distinguish between Train and Test datasets
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE

#Create a missing column for Test data
titanic.test$Survived <- NA

#Combine Test and Train Datasets
titanic.full <- rbind(titanic.train , titanic.test)
tail(titanic.full)

titanic.full$Name <- as.character(titanic.full$Name)

titanic.full$Title <- sapply(titanic.full$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
titanic.full$Title <- sub(' ','',titanic.full$Title)

titanic.full$Title[titanic.full$Title %in% c('Capt', 'Col' , 'Dr' , 'Don', 'Major', 'Sir' , 'Rev' ,
                                             'Dona', 'Lady', 'the Countess' , 'Jonkheer', 'Master')] <- 'Noble'


titanic.full$Title[titanic.full$Title %in% c('Ms', 'Miss' , 'Mlle')] <- 'Miss'
titanic.full$Title[titanic.full$Title %in% c('Mrs' , 'Mme')] <- 'Mrs'
table(titanic.full$Title) 

#Family size 3 and greater are TRUE or 1
titanic.full$Family <- titanic.full$SibSp + titanic.full$Parch + 1
table(titanic.full$Family)
#titanic.full$Family <- titanic.full$Family >= 3
#titanic.full$Family <- as.factor(titanic.full$Family)
#levels(titanic.full$Family) <- c(0,1)
#titanic.full$Family


titanic.full <- titanic.full[c( "Pclass" , "Title" , "Sex" , "Age" , "Family"  , "Fare", "SibSp" , "Parch"  , "Embarked"  , "Survived")]
head(titanic.full)




#Categorical Casting
titanic.full$Title <- as.factor(titanic.full$Title)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Embarked <- as.factor(titanic.full$Embarked)


titanicDummy <- dummyVars("~.",data=titanic.full, fullRank=T)
titanic.full <- as.data.frame(predict(titanicDummy,titanic.full))
print(names(titanic.full))



#Create test and train data sets
titanic.train <- titanic.full[1:891,]
titanic.test <- titanic.full[892:1309,]

#XGBoosting

set.seed(35)
labs <- titanic.train$Survived
names(titanic.full)
dat <- titanic.train[c("Pclass","Title.Mr","Title.Mrs","Title.Noble", "Sex.male","Age", "Family", "Fare", "SibSp","Parch","Embarked.C","Embarked.Q","Embarked.S")]
titdata <- xgb.DMatrix(data = as.matrix(dat), missing = NA, label=as.numeric(labs))
res <- xgb.cv(objective="binary:logistic" , eta=0.1, metric="auc", max_depth = 3,
              data = titdata , label=as.numeric(labs) , nrounds =   200 , nfold = 10 , prediction = TRUE)
#加载数据集
titanic.train通常(无论您使用何种特定算法),检测过拟合的方法如下:

1) 将数据集拆分为训练集和测试集(例如90%训练集,10%测试数据集)

2) 在Train dataset上对分类器进行一定次数的迭代训练(如果您尝试调整各种参数值而不是多次迭代训练,则使用超参数的某些值)

3) 尝试对测试数据集使用经过训练的分类器,并计算其准确度(F1度量,或AUC,或仅准确度,如果您愿意)

4) 重复3-4,直到从#3开始的度量与上一步相比开始减少


在您的情况下,您没有将数据集拆分为训练和测试-因此我认为无法检测您是否实际过度拟合。

非常感谢您的回复。我明白你想说的。我把训练集分为90个训练集和10个验证集。我使用了与上面相同的xgb模型,调整了一些参数,在验证集上预测并计算了F1分数,得到了0.862069