为什么要使用as.factor（）而不是factor（）_R_R Factor

为什么要使用as.factor（）而不是factor（）

为什么要使用as.factor（）而不是factor（）,r,r-factor,R,R Factor,我最近看到Matt Dowle使用as.factor（）编写了一些代码 for (col in names_factors) set(dt, j=col, value=as.factor(dt[[col]])) 在我使用了这个代码段，但我需要显式设置因子级别，以确保这些级别按我想要的顺序显示，因此我不得不更改 as.factor(dt[[col]]) 到这让我思考：使用as.factor（）与只使用factor（）相比，有什么好处（如果有的话）？as.factor是factor的包装器，

我最近看到Matt Dowle使用

as.factor（）

编写了一些代码

for (col in names_factors) set(dt, j=col, value=as.factor(dt[[col]]))

在

我使用了这个代码段，但我需要显式设置因子级别，以确保这些级别按我想要的顺序显示，因此我不得不更改

as.factor(dt[[col]])

到

这让我思考：使用

as.factor（）

与只使用

factor（）

相比，有什么好处（如果有的话）？

as.factor

是

factor的包装器，但是如果输入向量已经是一个因子，它允许快速返回：
function (x) 
{
    if (is.factor(x)) 
        x
    else if (!is.object(x) && is.integer(x)) {
        levels <- sort(unique.default(x))
        f <- match(x, levels)
        levels(f) <- as.character(levels)
        if (!is.null(nx <- names(x))) 
        names(f) <- nx
        class(f) <- "factor"
        f
    }
else factor(x)
}


两年后的扩展答案，包括以下内容：

手册上怎么说
性能：as.factor
factor
当输入是一个因子时
性能：as.factor
factor
当输入为整数时
未使用的水平或NA水平
使用R的group by函数时的注意事项：注意未使用的或NA级别


手册上怎么说？
系数的文档中提到了以下内容：
‘factor(x, exclude = NULL)’ applied to a factor without ‘NA’s is a
 no-operation unless there are unused levels: in that case, a
 factor with the reduced level set is returned.

 ‘as.factor’ coerces its argument to a factor.  It is an
 abbreviated (sometimes faster) form of ‘factor’.

性能：as.factor
factor当输入是一个因子时
“不操作”这个词有点模棱两可。不要认为这是“无所事事”；事实上，这意味着“做了很多事情，但实际上什么也没有改变”。以下是一个例子：
set.seed(0)
## a randomized long factor with 1e+6 levels, each repeated 10 times
f <- sample(gl(1e+6, 10))

system.time(f1 <- factor(f))  ## default: exclude = NA
#   user  system elapsed 
#  7.640   0.216   7.887 

system.time(f2 <- factor(f, exclude = NULL))
#   user  system elapsed 
#  7.764   0.028   7.791 

system.time(f3 <- as.factor(f))
#   user  system elapsed 
#      0       0       0 

identical(f, f1)
#[1] TRUE

identical(f, f2)
#[1] TRUE

identical(f, f3)
#[1] TRUE

它首先对输入向量f
的unique
值进行排序
，然后将f
转换为字符向量，最后使用factor
将字符向量强制转换回因子。以下是因子
的源代码以供确认
function (x = character(), levels, labels = levels, exclude = NA, 
    ordered = is.ordered(x), nmax = NA) 
{
    if (is.null(x)) 
        x <- character()
    nx <- names(x)
    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    }
    force(ordered)
    if (!is.character(x)) 
        x <- as.character(x)
    levels <- levels[is.na(match(levels, exclude))]
    f <- match(x, levels)
    if (!is.null(nx)) 
        names(f) <- nx
    nl <- length(labels)
    nL <- length(levels)
    if (!any(nl == c(1L, nL))) 
        stop(gettextf("invalid 'labels'; length %d should be 1 or %d", 
            nl, nL), domain = NA)
    levels(f) <- if (nl == nL) 
        as.character(labels)
    else paste0(labels, seq_along(levels))
    class(f) <- c(if (ordered) "ordered", "factor")
    f
}

这意味着将整数转换为因子比将数字/字符转换为因子更容易<代码>as.factor
只需处理这个问题
x <- sample.int(1e+6, 1e+7, TRUE)

system.time(as.factor(x))
#   user  system elapsed 
#  4.592   0.252   4.845 

system.time(factor(x))
#   user  system elapsed 
# 22.236   0.264  22.659 

有一个（通用）函数droplevels
，可用于删除未使用的因子级别。但是默认情况下，NA
级别不能删除
## "factor" method of `droplevels`
droplevels.factor
#function (x, exclude = if (anyNA(levels(x))) NULL else NA, ...) 
#factor(x, exclude = exclude)

droplevels(f)
#[1] 1    <NA>
#Levels: 1 <NA>

droplevels(f, exclude = NA)
#[1] 1    <NA>
#Levels: 1

有趣的是，尽管表
不依赖于as.factor
，但它也保留了那些未使用的级别：
table(f)
#a b c 
#1 1 0 

有时这种行为是不受欢迎的。一个典型的例子是条形图（表（f））
：

如果这确实是不需要的，我们需要使用droplevels
或factor
从因子变量中手动删除未使用或NA
级别
提示：
split
有一个参数drop
，该参数默认为FALSE
，因此使用了as.factor
；通过drop=TRUE
函数，改为使用因子

aggregate
依赖于split
，因此它还有一个drop
参数，默认为TRUE
tapply
没有drop
，尽管它也依赖于split
。特别是文档？tapply
中说，始终使用as.factor
命名一致性是一个大问题。几乎所有的普通类都有一个as.class
函数。答案中函数因子的源代码在R3.4.4下。源代码自R3.5.0以来发生了很大变化，但答案中的所有结论仍然有效。
unclass(gl(2, 2, labels = letters[1:2]))
#[1] 1 1 2 2
#attr(,"levels")
#[1] "a" "b"

storage.mode(gl(2, 2, labels = letters[1:2]))
#[1] "integer"

x <- sample.int(1e+6, 1e+7, TRUE)

system.time(as.factor(x))
#   user  system elapsed 
#  4.592   0.252   4.845 

system.time(factor(x))
#   user  system elapsed 
# 22.236   0.264  22.659 

f <- factor(c(1, NA), exclude = NULL)
#[1] 1    <NA>
#Levels: 1 <NA>

as.factor(f)
#[1] 1    <NA>
#Levels: 1 <NA>

factor(f, exclude = NULL)
#[1] 1    <NA>
#Levels: 1 <NA>

factor(f)
#[1] 1    <NA>
#Levels: 1

## "factor" method of `droplevels`
droplevels.factor
#function (x, exclude = if (anyNA(levels(x))) NULL else NA, ...) 
#factor(x, exclude = exclude)

droplevels(f)
#[1] 1    <NA>
#Levels: 1 <NA>

droplevels(f, exclude = NA)
#[1] 1    <NA>
#Levels: 1

x <- c(1, 2)
f <- factor(letters[1:2], levels = letters[1:3])

split(x, f)
#$a
#[1] 1
#
#$b
#[1] 2
#
#$c
#numeric(0)

tapply(x, f, FUN = mean)
# a  b  c 
# 1  2 NA 

table(f)
#a b c 
#1 1 0