R “如何避免错误”；“因子有了新的层次”；在交叉验证glm中？_R

R “如何避免错误”；“因子有了新的层次”；在交叉验证glm中？

R “如何避免错误”；“因子有了新的层次”；在交叉验证glm中？,r,R,我的目标是使用交叉验证来评估线性模型的性能我的问题是，我的训练集和测试集可能并不总是具有相同的变量级别以下是可复制数据示例： set.seed(1) x <- rnorm(n = 1000) y <- rep(x = c("A","B"), times = c(500,500)) z <- rep(x = c("D","E","F"), times = c(997,2,1)) data <- data.frame(x,y,z) summary(data) 如果

我的目标是使用交叉验证来评估线性模型的性能

我的问题是，我的训练集和测试集可能并不总是具有相同的变量级别

以下是可复制数据示例：

set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))

data <- data.frame(x,y,z)

summary(data)

如果你没有得到这个错误，重新运行交叉验证，在某个时候你会得到一个类似的错误

这里问题的本质是，当您进行交叉验证时，训练和测试子集可能没有完全相同的变量级别。这里我们的变量

有三个级别（

，

）

在我们的数据总量中，

比

和

多得多

因此，每当您获取整个数据的一小部分时（进行交叉验证）

很有可能您的

变量都将设置为

级别

因此，

和

级别被删除，因此我们得到了错误（这个答案有助于理解问题：）

我的问题是：首先如何避免下跌

如果不可能，有哪些替代方案

（请记住，这是一个可重复的示例，我使用的实际数据有许多变量，如

，我希望避免删除它们。）

要回答您在评论中的问题，我不知道是否有函数。很可能有一个，但我不知道哪个包裹上会有它。在本例中，此功能应起作用：

set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)

#optional tag row for later identification: 
#data$rowid<-1:nrow(data)

stratified <- function(df, column, percent){
  #split dataframe into groups based on column
  listdf<-split(df, df[[column]])
  testsubgroups<-lapply(listdf, function(x){
    #pick the number of samples per group, round up.
    numsamples <- ceiling(percent*nrow(x))
    #selects the rows
    whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
    testsubgroup <-x[whichones,] 
  })  
  #combine the subgroups into one data frame
  testgroup<-do.call(rbind, testsubgroups)
  testgroup
}

testgroup<-stratified(data, "z", 0.8)

set.seed（1）
您可以尝试分层抽样方法，而不是对初始数据集进行随机抽样。用x变量拆分初始集，然后随机采样每个子集，然后将随机子集合并到训练数据集中。@DaveT您知道是否有可用的函数来完成此操作吗？
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor z has new levels F

set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)

#optional tag row for later identification: 
#data$rowid<-1:nrow(data)

stratified <- function(df, column, percent){
  #split dataframe into groups based on column
  listdf<-split(df, df[[column]])
  testsubgroups<-lapply(listdf, function(x){
    #pick the number of samples per group, round up.
    numsamples <- ceiling(percent*nrow(x))
    #selects the rows
    whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
    testsubgroup <-x[whichones,] 
  })  
  #combine the subgroups into one data frame
  testgroup<-do.call(rbind, testsubgroups)
  testgroup
}

testgroup<-stratified(data, "z", 0.8)