Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
总结R数据框架中的因素分布_R_Dataframe_Summarization - Fatal编程技术网

总结R数据框架中的因素分布

总结R数据框架中的因素分布,r,dataframe,summarization,R,Dataframe,Summarization,假设我有这样一个data.frame: X1 X2 X3 1 A B A 2 A C B 3 B A B 4 A A C 我想计算每列中出现的A、B、C等,并将结果返回为 A_count B_count C_count X1 3 1 0 X2 2 1 1 X3 1 2 1 我确信这个问题有一千个重复,但我似乎找不到一个适合我的答案:(

假设我有这样一个data.frame:

  X1   X2   X3
1 A    B    A
2 A    C    B
3 B    A    B
4 A    A    C
我想计算每列中出现的A、B、C等,并将结果返回为

    A_count B_count C_count
X1  3       1       0       
X2  2       1       1
X3  1       2       1
我确信这个问题有一千个重复,但我似乎找不到一个适合我的答案:(

通过运行

apply(mydata, 2, table)
我得到的是

$X1
   B     A
   1     3
$X2
   A     C     B
   2     1     1
但这并不完全是我想要的,如果我试图将它构建回数据帧,它就不起作用,因为我没有为每一行获得相同数量的列(比如上面的$X1,其中没有C)

我错过了什么


非常感谢!

您可以重构以包含每个列的公共因子级别,然后制表。我还建议使用
lappy()
而不是
apply()
,因为
apply()
用于矩阵

df <- read.table(text = "X1   X2   X3
1 A    B    A
2 A    C    B
3 B    A    B
4 A    A    C", h=T)

do.call(
    rbind, 
    lapply(df, function(x) table(factor(x, levels=levels(unlist(df)))))
)
#    A B C
# X1 3 1 0
# X2 2 1 1
# X3 1 2 1

df假设您的数据帧是
x
,我只需执行以下操作:

do.call(rbind, tapply(unlist(x, use.names = FALSE),
                      rep(1:ncol(x), each = nrow(x)),
                      table))

#  A B C
#1 3 1 0
#2 2 1 1
#3 1 2 1

基准测试

# a function to generate toy data
# `k` factor levels
# `n` row
# `p` columns
datsim <- function(n, p, k) {
  as.data.frame(replicate(p, sample(LETTERS[1:k], n, TRUE), simplify = FALSE),
                col.names = paste0("X",1:p), stringsAsFactors = TRUE)
  }

# try `n = 100`, `p = 500` and `k = 3`
x <- datsim(100, 500, 3)

## DirtySockSniffer's answer
system.time(do.call(rbind, lapply(x, function(u) table(factor(u, levels=levels(unlist(x)))))))
#   user  system elapsed 
# 21.240   0.068  21.365 

## my answer
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
#   user  system elapsed 
#  0.108   0.000   0.111 
改进的Dirty的答案是:

system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
#   user  system elapsed 
#  1.844   0.056   1.904 
system.time({clevels <- levels(unlist(x, use.names = FALSE));
             do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
#   user  system elapsed 
#  1.240   0.012   1.263 

system.time({clevels Hi Zheyuan,不重要,但在我的笔记本电脑上
levels(u)[u]
as.character
慢一点(我认为这是有道理的,因为我确信r的人已经优化了这个)对于第二个例子,它看起来似乎更快,因为在较小的向量上调用as.numeric,而不是在完整的向量上。因此,如果需要转换为numeric,它看起来会更快,正如您所说。
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
#   user  system elapsed 
#  1.844   0.056   1.904 
system.time({clevels <- levels(unlist(x, use.names = FALSE));
             do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
#   user  system elapsed 
#  1.240   0.012   1.263