Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
减少每个因素dplyr方法的级别数_R_Dplyr_Levels - Fatal编程技术网

减少每个因素dplyr方法的级别数

减少每个因素dplyr方法的级别数,r,dplyr,levels,R,Dplyr,Levels,我试图减少数据中每个因子变量的级别数。我想减少执行2个操作的级别数: 如果级别数大于截止值,则将频率较低的级别更换为新级别,直到级别数达到截止值 将因子中观测值不足的级别替换为新级别 我写了一个运行良好的函数,但我不喜欢它的代码。如果没有足够的观测值,则水平保持不变并不重要。我更喜欢dplyr方法 ReplaceFactor <- function(data, max_levels, min_values_factor){ # First make sure that not to

我试图减少数据中每个因子变量的级别数。我想减少执行2个操作的级别数:

  • 如果级别数大于截止值,则将频率较低的级别更换为新级别,直到级别数达到截止值
  • 将因子中观测值不足的级别替换为新级别
  • 我写了一个运行良好的函数,但我不喜欢它的代码。如果没有足够的观测值,则水平保持不变并不重要。我更喜欢dplyr方法

    ReplaceFactor <- function(data, max_levels, min_values_factor){
        # First make sure that not to many levels are in a factor
        for(i in colnames(data)){
            if(class(data[[i]]) ==  "factor"){
                if(length(levels(data[[i]])) > max_levels){
                    levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)]
                    data[!get(i) %in% levels_keep, (i) := "REMAIN"]
                    data[[i]] <- as.factor(as.character(data[[i]]))
                }
            } 
        }
        # Now make sure that in each level has enough observations
        for(i in colnames(data)){
            if(class(data[[i]]) ==  "factor"){
                if(min(table(data[[i]])) < min_values_factor){
                    levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor]
                    data[get(i) %in% names(levels_replace), (i) := "REMAIN"]
                    data[[i]] <- as.factor(as.character(data[[i]]))
                }
            } 
        }
        return(data)
    }
    df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"), 
                     B = 1:9, 
                     C = c("A","A","B","B","C","C","C","D","D"), 
                     D = c("A","B","E", "E", "E","E","E", "E", "E"))
    str(df)
    'data.frame':   9 obs. of  4 variables:
     $ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
     $ B: int  1 2 3 4 5 6 7 8 9
     $ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4
     $ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3
    
    dt2 <- ReplaceFactor(data = data.table(df),
                  max_levels = 3,
                  min_values_factor = 2)
    str(dt2)
    Classes ‘data.table’ and 'data.frame':  9 obs. of  4 variables:
     $ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
     $ B: int  1 2 3 4 5 6 7 8 9
     $ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3
     $ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1
     - attr(*, ".internal.selfref")=<externalptr>
     dt2
       A B      C      D
    1: A 1      A REMAIN
    2: A 2      A REMAIN
    3: B 3 REMAIN      E
    4: B 4 REMAIN      E
    5: C 5      C      E
    6: C 6      C      E
    7: C 7      C      E
    8: C 8 REMAIN      E
    9: C 9 REMAIN      E
    
    ReplaceFactor最大值(U级){
    
    使用猫的
    保持水平

    library(dplyr)
    library(forcats)
    
    max_levels <- 3
    min_values_factor <- 2
    df %>% 
      mutate_if(is.factor, fct_lump, n = max_levels, 
                other_level = "REMAIN", ties.method = "first") %>% 
      mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1) / nrow(.), 
                other_level = "REMAIN")
    
    #   A B      C      D
    # 1 A 1      A REMAIN
    # 2 A 2      A REMAIN
    # 3 B 3      B      E
    # 4 B 4      B      E
    # 5 C 5      C      E
    # 6 C 6      C      E
    # 7 C 7      C      E
    # 8 C 8 REMAIN      E
    # 9 C 9 REMAIN      E
    
    库(dplyr)
    图书馆(供猫用)
    最高水平%
    如果(is.factor,fct\u lump,prop=(最小值\u factor-1)/nrow(.),则进行变异,
    其他_level=“剩余”)
    #A、B、C、D
    #1 A 1 A剩余
    #2 A 2 A剩余
    #3 B 3 B E
    #4 B 4 B E
    #5C5E
    #6C6E
    #7C 7C E
    #8 C 8保持E
    #9 C 9保持E
    

    (哦,我无法复制函数的确切行为,但通过调整
    ties.method
    并将1减去
    max_levels
    ,您可能会得到想要的结果。).

    我建议您看看
    forcats
    软件包,该软件包对于此类任务具有很好的功能:
    fct\u lump
    ,例如,可能会有所帮助