R “如何调试”;对比度仅适用于具有两个或两个以上级别的因素”;错误?

R “如何调试”;对比度仅适用于具有两个或两个以上级别的因素”;错误?,r,regression,lm,glm,r-faq,R,Regression,Lm,Glm,R Faq,以下是我正在处理的所有变量: str(ad.train) $ Date : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ... $ Team : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ... $ Season

以下是我正在处理的所有变量:

str(ad.train)
$ Date                : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ...
 $ Team                : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season              : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Round               : Factor w/ 28 levels "EF","GF","PF",..: 5 16 21 22 23 24 25 26 27 6 ...
 $ Score               : int  137 82 84 96 110 99 122 124 49 111 ...
 $ Margin              : int  69 18 -56 46 19 5 50 69 -26 29 ...
 $ WinLoss             : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 2 ...
 $ Opposition          : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 8 18 10 9 13 16 7 3 4 6 ...
 $ Venue               : Factor w/ 19 levels "Adelaide Oval",..: 4 7 10 7 7 13 7 6 7 15 ...
 $ Disposals           : int  406 360 304 370 359 362 365 345 324 351 ...
 $ Kicks               : int  252 215 170 225 221 218 224 230 205 215 ...
 $ Marks               : int  109 102 52 41 95 78 93 110 69 85 ...
 $ Handballs           : int  154 145 134 145 138 144 141 115 119 136 ...
 $ Goals               : int  19 11 12 13 16 15 19 19 6 17 ...
 $ Behinds             : int  19 14 9 16 11 6 7 9 12 6 ...
 $ Hitouts             : int  42 41 34 47 45 70 48 54 46 34 ...
 $ Tackles             : int  73 53 51 76 65 63 65 67 77 58 ...
 $ Rebound50s          : int  28 34 23 24 32 48 39 31 34 29 ...
 $ Inside50s           : int  73 49 49 56 61 45 47 50 49 48 ...
 $ Clearances          : int  39 33 38 52 37 43 43 48 37 52 ...
 $ Clangers            : int  47 38 44 62 49 46 32 24 31 41 ...
 $ FreesFor            : int  15 14 15 18 17 15 19 14 18 20 ...
 $ ContendedPossessions: int  152 141 149 192 138 164 148 151 160 155 ...
 $ ContestedMarks      : int  10 16 11 3 12 12 17 14 15 11 ...
 $ MarksInside50       : int  16 13 10 8 12 9 14 13 6 12 ...
 $ OnePercenters       : int  42 54 30 58 24 56 32 53 50 57 ...
 $ Bounces             : int  1 6 4 4 1 7 11 14 0 4 ...
 $ GoalAssists         : int  15 6 9 10 9 12 13 14 5 14 ...
以下是我试图适应的glm:

ad.glm.all <- glm(WinLoss ~ factor(Team) + Season  + Round + Score  + Margin + Opposition + Venue + Disposals + Kicks + Marks + Handballs + Goals + Behinds + Hitouts + Tackles + Rebound50s + Inside50s+ Clearances+ Clangers+ FreesFor + ContendedPossessions + ContestedMarks + MarksInside50 + OnePercenters + Bounces+GoalAssists, 
                  data = ad.train, family = binomial(logit))
ad.glm.all简介
。但实际上,这个简单的事实很容易被掩盖,因为实际用于模型拟合的数据可能与您传递的数据非常不同。当您的数据中有
NA
时,会发生这种情况,您已对数据进行了子集划分,某个因子有未使用的级别,或者您已转换变量并在某个地方得到
NaN
。关于StackOverflow这个错误的许多问题是不可复制的,因此人们的建议可能有效,也可能无效。因此,尽管到目前为止对这个问题还存在争议,但用户仍然无法找到一个自适应的解决方案,所以这个问题一再被提出。这个答案是我的尝试,解决这个问题“一劳永逸”,或者至少提供一个合理的指导

这个答案有丰富的信息,所以让我先做一个简短的总结

我为您定义了3个帮助函数:
debug\u-contr\u-error
debug\u-contr\u-error 2
NA\u-preproc

我建议您按以下方式使用它们

  • 运行
    NA_preproc
    以获得更完整的案例
  • 运行模型,如果出现“对比度错误”,请使用
    debug\u contr\u error 2
    进行调试
  • 大部分答案都会一步一步地告诉您如何定义这些函数以及为什么定义这些函数。跳过这些开发过程可能没有坏处,但不要跳过“可复制的案例研究和讨论”中的部分


    订正答复 和。但由于缺乏适应能力。查看问题中
    str(ad.train)
    的输出。OP的变量为数值或因子;没有字符。最初的答案是针对这种情况。如果您有字符变量,尽管在
    lm
    glm
    拟合期间它们将被强制为因子,但代码不会报告它们,因为它们不是作为因子提供的,所以
    是。因子将丢失它们。在此扩展中,我将使原始答案更具适应性

    将数据集
    dat
    传递给
    lm
    glm
    。如果您没有这样一个数据框架,也就是说,您的所有变量都分散在全局环境中,那么您需要将它们收集到一个数据框架中。下面的方法可能不是最好的方法,但很有效

    ## `form` is your model formula, here is an example
    y <- x1 <- x2 <- x3 <- 1:4
    x4 <- matrix(1:8, 4)
    form <- y ~ bs(x1) + poly(x2) + I(1 / x3) + x4
    
    ## to gather variables `model.frame.default(form)` is the easiest way 
    ## but it does too much: it drops `NA` and transforms variables
    ## we want something more primitive
    
    ## first get variable names
    vn <- all.vars(form)
    #[1] "y"  "x1" "x2" "x3" "x4"
    
    ## `get_all_vars(form)` gets you a data frame
    ## but it is buggy for matrix variables so don't use it
    ## instead, first use `mget` to gather variables into a list
    lst <- mget(vn)
    
    ## don't do `data.frame(lst)`; it is buggy with matrix variables
    ## need to first protect matrix variables by `I()` then do `data.frame`
    lst_protect <- lapply(lst, function (x) if (is.matrix(x)) I(x) else x)
    dat <- data.frame(lst_protect)
    str(dat)
    #'data.frame':  4 obs. of  5 variables:
    # $ y : int  1 2 3 4
    # $ x1: int  1 2 3 4
    # $ x2: int  1 2 3 4
    # $ x3: int  1 2 3 4
    # $ x4: 'AsIs' int [1:4, 1:2] 1 2 3 4 5 6 7 8
    
    ## note the 'AsIs' for matrix variable `x4`
    ## in comparison, try the following buggy ones yourself
    str(get_all_vars(form))
    str(data.frame(lst))
    
    步骤1:删除不完整的案例

    dat <- na.omit(dat)
    
    逻辑变量很复杂。它可以被视为一个虚拟变量(
    1
    表示
    TRUE
    0
    表示
    FALSE
    ),因此是一个“数字”,也可以强制为两级因子。这完全取决于
    model.matrix
    是否认为模型公式规范中的“to factor”强制是必要的。为了简单起见,我们可以这样理解它:它总是被强制为一个因子,但应用对比的结果可能最终得到相同的模型矩阵,就好像它被直接作为虚拟处理一样

    有些人可能想知道为什么不包括“整数”。因为整数向量(如
    1:4
    )具有“数值”模式(请尝试
    模式(1:4)

    数据帧列也可以是具有“AsIs”类的矩阵,但此类矩阵必须具有“数值”模式

    我们的检查是在以下情况下产生错误:

    • 发现“复杂”或“原始”
    • 找到“逻辑”或“字符”矩阵变量
    并继续将“factor”类的“logical”和“character”转换为“numeric”

    步骤4:总结因素变量

    dat <- na.omit(dat)
    
    现在我们准备好看看
    lm
    glm
    实际使用了什么和多少因子水平:

    ## export factor levels actually used by `lm` and `glm`
    lev <- lapply(dat[fctr], levels)
    
    ## count number of levels
    nl <- lengths(lev)
    
    下面是一个小例子

    dat <- data.frame(y = 1:4,
                      x = c(1:3, NA),
                      f1 = gl(2, 2, labels = letters[1:2]),
                      f2 = c("A", "A", "A", "B"),
                      stringsAsFactors = FALSE)
    
    #  y  x f1 f2
    #1 1  1  a  A
    #2 2  2  a  A
    #3 3  3  b  A
    #4 4 NA  b  B
    
    str(dat)
    #'data.frame':  4 obs. of  4 variables:
    # $ y : int  1 2 3 4
    # $ x : int  1 2 3 NA
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
    # $ f2: chr  "A" "A" "A" "B"
    
    lm(y ~ x + f1 + f2, dat)
    #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    #  contrasts can be applied only to factors with 2 or more levels
    
    请注意,原始简短答案在这里是没有希望的,因为
    f2
    是作为字符变量而不是因子变量提供的

    ## old answer
    tmp <- na.omit(dat)
    fctr <- lapply(tmp[sapply(tmp, is.factor)], droplevels)
    sapply(fctr, nlevels)
    #f1 
    # 2 
    rm(tmp, fctr)
    
    dat <- data.frame(y = 1:4,
                      x = c(1:3, NA),
                      f1 = gl(2, 2, labels = letters[1:2]),
                      f2 = c(TRUE, TRUE, TRUE, FALSE))
    
    dat
    #  y  x f1    f2
    #1 1  1  a  TRUE
    #2 2  2  a  TRUE
    #3 3  3  b  TRUE
    #4 4 NA  b FALSE
    
    str(dat)
    #'data.frame':  4 obs. of  4 variables:
    # $ y : int  1 2 3 4
    # $ x : int  1 2 3 NA
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
    # $ f2: logi  TRUE TRUE TRUE FALSE
    
    dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")
    
    dat_internal
    #  y x f1 f2
    #1 1 1  a  A
    #2 2 2  a  A
    #3 3 3  b  A
    
    str(dat_internal)
    #'data.frame':  3 obs. of  4 variables:
    # $ y : int  1 2 3
    # $ x : int  1 2 3
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2
    # $ f2: chr  "A" "A" "A"
    ## [.."terms" attribute is truncated..]
    
    请注意,没有级别的因子变量也会导致“对比度错误”。您可能想知道0级因子是如何可能的。这是合法的:
    nlevels(factor(character(0))
    。在这里,如果没有完整的案例,您将以0级因子结束

    dat <- data.frame(y = 1:4,
                      x = rep(NA_real_, 4),
                      f1 = gl(2, 2, labels = letters[1:2]),
                      f2 = c("A", "A", "A", "B"),
                      stringsAsFactors = FALSE)
    
    lm(y ~ x + f1 + f2, dat)
    #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    #  contrasts can be applied only to factors with 2 or more levels
    
    debug_contr_error(dat)$nlevels
    #f1 f2 
    # 0  0    ## all values are 0
    #Warning message:
    #In debug_contr_error(dat) : no complete cases
    
    我们的调试器将预测“对比度错误”,但它真的会发生吗

    debug_contr_error(dat)$nlevels
    #f1 f2 
    # 2  1 
    
    不,至少这个没有失败():

    我很难举一个错误的例子,但也没有必要。实际上,我们不使用调试器进行预测;当我们真的遇到错误时,我们使用它;在这种情况下,调试器可以定位有问题的因子变量

    ## old answer
    tmp <- na.omit(dat)
    fctr <- lapply(tmp[sapply(tmp, is.factor)], droplevels)
    sapply(fctr, nlevels)
    #f1 
    # 2 
    rm(tmp, fctr)
    
    dat <- data.frame(y = 1:4,
                      x = c(1:3, NA),
                      f1 = gl(2, 2, labels = letters[1:2]),
                      f2 = c(TRUE, TRUE, TRUE, FALSE))
    
    dat
    #  y  x f1    f2
    #1 1  1  a  TRUE
    #2 2  2  a  TRUE
    #3 3  3  b  TRUE
    #4 4 NA  b FALSE
    
    str(dat)
    #'data.frame':  4 obs. of  4 variables:
    # $ y : int  1 2 3 4
    # $ x : int  1 2 3 NA
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
    # $ f2: logi  TRUE TRUE TRUE FALSE
    
    dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")
    
    dat_internal
    #  y x f1 f2
    #1 1 1  a  A
    #2 2 2  a  A
    #3 3 3  b  A
    
    str(dat_internal)
    #'data.frame':  3 obs. of  4 variables:
    # $ y : int  1 2 3
    # $ x : int  1 2 3
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2
    # $ f2: chr  "A" "A" "A"
    ## [.."terms" attribute is truncated..]
    
    也许有人会说逻辑变量和虚拟变量没有什么不同。但请尝试下面的简单示例:这取决于您的公式

    u <- c(TRUE, TRUE, FALSE, FALSE)
    v <- c(1, 1, 0, 0)  ## "numeric" dummy of `u`
    
    model.matrix(~ u)
    #  (Intercept) uTRUE
    #1           1     1
    #2           1     1
    #3           1     0
    #4           1     0
    
    model.matrix(~ v)
    #  (Intercept) v
    #1           1 1
    #2           1 1
    #3           1 0
    #4           1 0
    
    model.matrix(~ u - 1)
    #  uFALSE uTRUE
    #1      0     1
    #2      0     1
    #3      1     0
    #4      1     0
    
    model.matrix(~ v - 1)
    #  v
    #1 1
    #2 1
    #3 0
    #4 0
    
    实际上,
    model.frame
    只执行步骤0和步骤1。它还会删除数据集中提供的变量,但不会删除模型公式中提供的变量。因此,模型框架的行数和列数可能比您输入的
    lm
    glm
    更少。我们在步骤2中完成的类型强制是由后面的
    model.matrix
    完成的,其中可能会产生“对比度错误”

    首先获取该内部模型框架,然后将其传递给
    调试控制错误
    (这样它基本上只执行步骤2到4),这有一些好处

    优点1:忽略模型公式中未使用的变量

    ## no variable `f1` in formula
    dat_internal <- lm(y ~ x + f2, dat, method = "model.frame")
    
    ## compare the following
    debug_contr_error(dat)$nlevels
    #f1 f2 
    # 2  1 
    
    debug_contr_error(dat_internal)$nlevels
    #f2 
    # 1 
    
    考虑到这些好处,我编写了另一个函数来包装
    model.frame
    debug\u control\u error

    输入

    • 表单
      是您的模型公式
    • dat
      是通过
      data
      参数传递给
      lm
      glm
      的数据集
    • subset_vec
      是通过
      subset
      参数传递给
      lm
      glm
      的索引向量
    输出:带有

    • <代码
      dat <- data.frame(y = 1:4,
                        x = rep(NA_real_, 4),
                        f1 = gl(2, 2, labels = letters[1:2]),
                        f2 = c("A", "A", "A", "B"),
                        stringsAsFactors = FALSE)
      
      lm(y ~ x + f1 + f2, dat)
      #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
      #  contrasts can be applied only to factors with 2 or more levels
      
      debug_contr_error(dat)$nlevels
      #f1 f2 
      # 0  0    ## all values are 0
      #Warning message:
      #In debug_contr_error(dat) : no complete cases
      
      dat <- data.frame(y = 1:4,
                        x = c(1:3, NA),
                        f1 = gl(2, 2, labels = letters[1:2]),
                        f2 = c(TRUE, TRUE, TRUE, FALSE))
      
      dat
      #  y  x f1    f2
      #1 1  1  a  TRUE
      #2 2  2  a  TRUE
      #3 3  3  b  TRUE
      #4 4 NA  b FALSE
      
      str(dat)
      #'data.frame':  4 obs. of  4 variables:
      # $ y : int  1 2 3 4
      # $ x : int  1 2 3 NA
      # $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
      # $ f2: logi  TRUE TRUE TRUE FALSE
      
      debug_contr_error(dat)$nlevels
      #f1 f2 
      # 2  1 
      
      lm(y ~ x + f1 + f2, data = dat)
      #Coefficients:
      #(Intercept)            x          f1b       f2TRUE  
      #          0            1            0           NA
      
      u <- c(TRUE, TRUE, FALSE, FALSE)
      v <- c(1, 1, 0, 0)  ## "numeric" dummy of `u`
      
      model.matrix(~ u)
      #  (Intercept) uTRUE
      #1           1     1
      #2           1     1
      #3           1     0
      #4           1     0
      
      model.matrix(~ v)
      #  (Intercept) v
      #1           1 1
      #2           1 1
      #3           1 0
      #4           1 0
      
      model.matrix(~ u - 1)
      #  uFALSE uTRUE
      #1      0     1
      #2      0     1
      #3      1     0
      #4      1     0
      
      model.matrix(~ v - 1)
      #  v
      #1 1
      #2 1
      #3 0
      #4 0
      
      dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")
      
      dat_internal
      #  y x f1 f2
      #1 1 1  a  A
      #2 2 2  a  A
      #3 3 3  b  A
      
      str(dat_internal)
      #'data.frame':  3 obs. of  4 variables:
      # $ y : int  1 2 3
      # $ x : int  1 2 3
      # $ f1: Factor w/ 2 levels "a","b": 1 1 2
      # $ f2: chr  "A" "A" "A"
      ## [.."terms" attribute is truncated..]
      
      ## no variable `f1` in formula
      dat_internal <- lm(y ~ x + f2, dat, method = "model.frame")
      
      ## compare the following
      debug_contr_error(dat)$nlevels
      #f1 f2 
      # 2  1 
      
      debug_contr_error(dat_internal)$nlevels
      #f2 
      # 1 
      
      dat <- data.frame(y = 1:4, x = c(1:3, -1), f = rep(letters[1:2], c(3, 1)))
      #  y  x f
      #1 1  1 a
      #2 2  2 a
      #3 3  3 a
      #4 4 -1 b
      
      lm(y ~ log(x) + f, data = dat)
      #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
      #  contrasts can be applied only to factors with 2 or more levels
      #In addition: Warning message:
      #In log(x) : NaNs produced
      
      # directly using `debug_contr_error` is hopeless here
      debug_contr_error(dat)$nlevels
      #f 
      #2 
      
      ## this works
      dat_internal <- lm(y ~ log(x) + f, data = dat, method = "model.frame")
      #  y    log(x) f
      #1 1 0.0000000 a
      #2 2 0.6931472 a
      #3 3 1.0986123 a
      
      debug_contr_error(dat_internal)$nlevels
      #f 
      #1
      
      ## note: this function relies on `debug_contr_error`
      debug_contr_error2 <- function (form, dat, subset_vec = NULL) {
        ## step 0
        if (!is.null(subset_vec)) {
          if (mode(subset_vec) == "logical") {
            if (length(subset_vec) != nrow(dat)) {
              stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
              }
            subset_log_vec <- subset_vec
            } else if (mode(subset_vec) == "numeric") {
            ## check range
            ran <- range(subset_vec)
            if (ran[1] < 1 || ran[2] > nrow(dat)) {
              stop("'numeric' `subset_vec` provided but values are out of bound")
              } else {
              subset_log_vec <- logical(nrow(dat))
              subset_log_vec[as.integer(subset_vec)] <- TRUE
              } 
            } else {
            stop("`subset_vec` must be either 'logical' or 'numeric'")
            }
          dat <- base::subset(dat, subset = subset_log_vec)
          }
        ## step 0 and 1
        dat_internal <- stats::lm(form, data = dat, method = "model.frame")
        attr(dat_internal, "terms") <- NULL
        ## rely on `debug_contr_error` for steps 2 to 4
        c(list(mf = dat_internal), debug_contr_error(dat_internal, NULL))
        }
      
      debug_contr_error2(y ~ log(x) + f, dat)
      #$mf
      #  y    log(x) f
      #1 1 0.0000000 a
      #2 2 0.6931472 a
      #3 3 1.0986123 a
      #
      #$nlevels
      #f 
      #1 
      #
      #$levels
      #$levels$f
      #[1] "a"
      #
      #
      #Warning message:
      #In log(x) : NaNs produced
      
      ## or: debug_contr_error2(y ~ log(x) + f, dat, c(T, F, T, T))
      debug_contr_error2(y ~ log(x) + f, dat, c(1,3,4))
      #$mf
      #  y   log(x) f
      #1 1 0.000000 a
      #3 3 1.098612 a
      #
      #$nlevels
      #f 
      #1 
      #
      #$levels
      #$levels$f
      #[1] "a"
      #
      #
      #Warning message:
      #In log(x) : NaNs produced
      
      ## x is a factor with NA
      
      x <- factor(c(letters[1:4], NA))  ## default: `exclude = NA`
      #[1] a    b    c    d    <NA>     ## there is an NA value
      #Levels: a b c d                  ## but NA is not a level
      
      na.omit(x)  ## NA is gone
      #[1] a b c d
      #[.. attributes truncated..]
      #Levels: a b c d
      
      x <- addNA(x)  ## now add NA into a valid level
      #[1] a    b    c    d    <NA>
      #Levels: a b c d <NA>  ## it appears here
      
      droplevels(x)    ## it can not be dropped
      #[1] a    b    c    d    <NA>
      #Levels: a b c d <NA>
      
      na.omit(x)  ## it is not omitted
      #[1] a    b    c    d    <NA>
      #Levels: a b c d <NA>
      
      model.matrix(~ x)   ## and it is valid to be in a design matrix
      #  (Intercept) xb xc xd xNA
      #1           1  0  0  0   0
      #2           1  1  0  0   0
      #3           1  0  1  0   0
      #4           1  0  0  1   0
      #5           1  0  0  0   1
      
      ## x is a character with NA
      
      x <- c(letters[1:4], NA)
      #[1] "a" "b" "c" "d" NA 
      
      as.factor(x)  ## this calls `factor(x)` with default `exclude = NA`
      #[1] a    b    c    d    <NA>     ## there is an NA value
      #Levels: a b c d                  ## but NA is not a level
      
      factor(x, exclude = NULL)      ## we want `exclude = NULL`
      #[1] a    b    c    d    <NA>
      #Levels: a b c d <NA>          ## now NA is a level
      
      NA_preproc <- function (dat) {
        for (j in 1:ncol(dat)) {
          x <- dat[[j]]
          if (is.factor(x) && anyNA(x)) dat[[j]] <- base::addNA(x)
          if (is.character(x)) dat[[j]] <- factor(x, exclude = NULL)
          }
        dat
        }
      
      df %>% dplyr::mutate_all(as.factor) %>% str
      
      train.df = read.csv('train.csv')
      lm1 = lm(SalePrice ~ ., data = train.df)
      
      cols = colnames(train.df)
      for (col in cols){
        if(is.factor(train.df[[col]])){
          cat(col, ' has ', length(levels(train.df[[col]])), '\n')
        }
      }
      
      train.df = subset(train.df, select=-c(Id, PoolQC,Fence, MiscFeature, Alley, Utilities))
      lm1 = lm(SalePrice ~ ., data = train.df)
      
      fill.na.with.mode = function(df){
          cols = colnames(df)
          for (col in cols){
              if(class(df[[col]])=='factor'){
                  x = summary(df[[col]])
                  mode = names(x[which.max(x)])
                  df[[col]][is.na(df[[col]])]=mode
              }
              else{
                  df[[col]][is.na(df[[col]])]=0
              }
          }
          return (df)
      }