R 查找最大颜色&；计数_R_Dataframe

R 查找最大颜色&；计数

r dataframe

R 查找最大颜色&；计数,r,dataframe,R,Dataframe,我有以下格式的矩阵： [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] "blue" "red" "blue" "blue" "blue" "red" "green" "blue" "blue" [2,] "green" "red" "blue" "blue" "blue" "red" "green" "blue" "blue" [3,] "yellow" "red" "blue" "

我有以下格式的矩阵：

     [,1]     [,2]  [,3]    [,4]   [,5]   [,6]  [,7]    [,8]   [,9]  
[1,] "blue"   "red" "blue"  "blue" "blue" "red" "green" "blue" "blue"
[2,] "green"  "red" "blue"  "blue" "blue" "red" "green" "blue" "blue"
[3,] "yellow" "red" "blue"  "blue" "blue" "red" "green" "blue" "blue"
[4,] "red"    "red" "blue"  "blue" "blue" "red" "green" "blue" "blue"
[5,] "blue"   "red" "green" "blue" "blue" "red" "green" "blue" "blue"
[6,] "green"  "red" "green" "blue" "blue" "red" "green" "blue" "blue"
 ...

如何快速计算每行的最大颜色和计数

例如，对于第1行，它将是“蓝色，6”。我是通过一个调用“table”的apply命令来实现的

但是，我的矩阵有190万行，因此需要的时间太长。如何将其矢量化？

矩阵的每个单元格有多少种不同的可能性？这和你的例子一样吗？如果是的话，像下面这样的事情可能会更快

dat <- structure(c("blue", "green", "yellow", "red", "blue", "green", 
    "red", "red", "red", "red", "red", "red", "red", "red", "blue", 
    "blue", "blue", "blue", "green", "green", "red", "blue", "blue", 
    "blue", "blue", "blue", "blue", "red", "blue", "blue", "blue", 
    "blue", "blue", "blue", "blue", "red", "red", "red", "red", "red", 
    "red", "blue", "green", "green", "green", "green", "green", "green", 
    "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", 
    "blue", "blue", "blue", "blue", "blue", "blue", "green"), .Dim = c(7L, 
    9L))

values <- c("blue", "red", "green", "yellow")
counts <- vapply(values, function(value) rowSums(dat == value), 
    numeric(nrow(dat))) # Thanks to @RichardScriven for the improvement :)
counts 
#      blue red green yellow
# [1,]    6   2     1      0
# [2,]    5   2     2      0
# [3,]    5   2     1      1
# [4,]    5   3     1      0
# [5,]    5   2     2      0
# [6,]    4   2     3      0
# [7,]    4   4     1      0

max.value.col <- max.col(counts)
max.value <- colnames(counts)[max.value.col]
max.counts <- counts[cbind(1:nrow(counts), max.value.col)]
paste(max.value, max.counts, sep = ", ")
# [1] "blue, 6" "blue, 5" "blue, 5" "blue, 5" "blue, 5" "blue, 4"

矩阵的每个单元格有多少种不同的可能性？这和你的例子一样吗？如果是的话，像下面这样的事情可能会更快

dat <- structure(c("blue", "green", "yellow", "red", "blue", "green", 
    "red", "red", "red", "red", "red", "red", "red", "red", "blue", 
    "blue", "blue", "blue", "green", "green", "red", "blue", "blue", 
    "blue", "blue", "blue", "blue", "red", "blue", "blue", "blue", 
    "blue", "blue", "blue", "blue", "red", "red", "red", "red", "red", 
    "red", "blue", "green", "green", "green", "green", "green", "green", 
    "blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue", 
    "blue", "blue", "blue", "blue", "blue", "blue", "green"), .Dim = c(7L, 
    9L))

values <- c("blue", "red", "green", "yellow")
counts <- vapply(values, function(value) rowSums(dat == value), 
    numeric(nrow(dat))) # Thanks to @RichardScriven for the improvement :)
counts 
#      blue red green yellow
# [1,]    6   2     1      0
# [2,]    5   2     2      0
# [3,]    5   2     1      1
# [4,]    5   3     1      0
# [5,]    5   2     2      0
# [6,]    4   2     3      0
# [7,]    4   4     1      0

max.value.col <- max.col(counts)
max.value <- colnames(counts)[max.value.col]
max.counts <- counts[cbind(1:nrow(counts), max.value.col)]
paste(max.value, max.counts, sep = ", ")
# [1] "blue, 6" "blue, 5" "blue, 5" "blue, 5" "blue, 5" "blue, 4"

我想这是一个实际的data.table解决方案。利用data.table的fast

.N

计算行频率

library(data.table)

flip <- data.table(t(mat))

tally <- lapply(names(flip), 
                function(x) {
                  setnames(flip[, .N, by=eval(x)][order(-N)][1,],
                           c('clr', 'N')) } )
do.call(rbind, tally)

#     clr N
# 1: blue 6
# 2: blue 5
# 3: blue 5
# 4: blue 5
# 5: blue 5
# 6: blue 4

我想这是一个实际的data.table解决方案。利用data.table的fast

.N

计算行频率

library(data.table)

flip <- data.table(t(mat))

tally <- lapply(names(flip), 
                function(x) {
                  setnames(flip[, .N, by=eval(x)][order(-N)][1,],
                           c('clr', 'N')) } )
do.call(rbind, tally)

#     clr N
# 1: blue 6
# 2: blue 5
# 3: blue 5
# 4: blue 5
# 5: blue 5
# 6: blue 4

你能显示你目前正在使用的代码作为比较吗？多长时间是“太长”？你需要多快完成？如果你不能回答这个问题，那么我不认为你可以说“太长”有多长。尽管有人发布了解决方案，大大加快了速度——代码通常在40秒左右的时间内运行。这个解决方案只需要大约一秒钟，这是完美的：-）。您可以显示当前用作比较的代码吗？多长时间是“太长”？你需要多快完成？如果你不能回答这个问题，那么我不认为你可以说“太长”有多长。尽管有人发布了解决方案，大大加快了速度——代码通常在40秒左右的时间内运行。这个解决方案只需要一秒钟，这是完美的：-）。

vapply（值、函数（值）行和（dat==value）、numeric（nrow（dat））

甚至可能比

sapply

@konvas更快。如果最大计数之间存在关联，

max.col

似乎可以任意选择其中一个。有没有找到所有最大值的

max.col

等价物？@RichardScriven说得好！这将大大提高速度。@Khashaa看一看

？max.col

。您可以调整

ties.method

参数，但只有三个选项可用-随机、第一个和最后一个。你到底想到了什么？只是如果它有

ties就好了。method=all

vapply（值，函数（值）行和（dat==value），numeric（nrow（dat））

甚至可能比

sapply

@konvas更快，如果最大计数之间有联系，

max.col

似乎任意选择其中一个。有没有找到所有最大值的

max.col

等价物？@RichardScriven说得好！这将大大提高速度。@Khashaa看一看

？max.col

。您可以调整

ties.method

参数，但只有三个选项可用-随机、第一个和最后一个。你到底在想什么？只是如果它有

ties.method=all

.eh就太好了，但是t（）运算在大矩阵上很慢。没关系，但是t（）运算在大型矩阵上很慢。不要介意