R 查找最大颜色&;计数
我有以下格式的矩阵:R 查找最大颜色&;计数,r,dataframe,R,Dataframe,我有以下格式的矩阵: [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] "blue" "red" "blue" "blue" "blue" "red" "green" "blue" "blue" [2,] "green" "red" "blue" "blue" "blue" "red" "green" "blue" "blue" [3,] "yellow" "red" "blue" "
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] "blue" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[2,] "green" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[3,] "yellow" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[4,] "red" "red" "blue" "blue" "blue" "red" "green" "blue" "blue"
[5,] "blue" "red" "green" "blue" "blue" "red" "green" "blue" "blue"
[6,] "green" "red" "green" "blue" "blue" "red" "green" "blue" "blue"
...
如何快速计算每行的最大颜色和计数
例如,对于第1行,它将是“蓝色,6”。我是通过一个调用“table”的apply命令来实现的
但是,我的矩阵有190万行,因此需要的时间太长。如何将其矢量化?矩阵的每个单元格有多少种不同的可能性?这和你的例子一样吗?如果是的话,像下面这样的事情可能会更快
dat <- structure(c("blue", "green", "yellow", "red", "blue", "green",
"red", "red", "red", "red", "red", "red", "red", "red", "blue",
"blue", "blue", "blue", "green", "green", "red", "blue", "blue",
"blue", "blue", "blue", "blue", "red", "blue", "blue", "blue",
"blue", "blue", "blue", "blue", "red", "red", "red", "red", "red",
"red", "blue", "green", "green", "green", "green", "green", "green",
"blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue",
"blue", "blue", "blue", "blue", "blue", "blue", "green"), .Dim = c(7L,
9L))
values <- c("blue", "red", "green", "yellow")
counts <- vapply(values, function(value) rowSums(dat == value),
numeric(nrow(dat))) # Thanks to @RichardScriven for the improvement :)
counts
# blue red green yellow
# [1,] 6 2 1 0
# [2,] 5 2 2 0
# [3,] 5 2 1 1
# [4,] 5 3 1 0
# [5,] 5 2 2 0
# [6,] 4 2 3 0
# [7,] 4 4 1 0
max.value.col <- max.col(counts)
max.value <- colnames(counts)[max.value.col]
max.counts <- counts[cbind(1:nrow(counts), max.value.col)]
paste(max.value, max.counts, sep = ", ")
# [1] "blue, 6" "blue, 5" "blue, 5" "blue, 5" "blue, 5" "blue, 4"
矩阵的每个单元格有多少种不同的可能性?这和你的例子一样吗?如果是的话,像下面这样的事情可能会更快
dat <- structure(c("blue", "green", "yellow", "red", "blue", "green",
"red", "red", "red", "red", "red", "red", "red", "red", "blue",
"blue", "blue", "blue", "green", "green", "red", "blue", "blue",
"blue", "blue", "blue", "blue", "red", "blue", "blue", "blue",
"blue", "blue", "blue", "blue", "red", "red", "red", "red", "red",
"red", "blue", "green", "green", "green", "green", "green", "green",
"blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue",
"blue", "blue", "blue", "blue", "blue", "blue", "green"), .Dim = c(7L,
9L))
values <- c("blue", "red", "green", "yellow")
counts <- vapply(values, function(value) rowSums(dat == value),
numeric(nrow(dat))) # Thanks to @RichardScriven for the improvement :)
counts
# blue red green yellow
# [1,] 6 2 1 0
# [2,] 5 2 2 0
# [3,] 5 2 1 1
# [4,] 5 3 1 0
# [5,] 5 2 2 0
# [6,] 4 2 3 0
# [7,] 4 4 1 0
max.value.col <- max.col(counts)
max.value <- colnames(counts)[max.value.col]
max.counts <- counts[cbind(1:nrow(counts), max.value.col)]
paste(max.value, max.counts, sep = ", ")
# [1] "blue, 6" "blue, 5" "blue, 5" "blue, 5" "blue, 5" "blue, 4"
我想这是一个实际的data.table解决方案。利用data.table的fast
.N
计算行频率
library(data.table)
flip <- data.table(t(mat))
tally <- lapply(names(flip),
function(x) {
setnames(flip[, .N, by=eval(x)][order(-N)][1,],
c('clr', 'N')) } )
do.call(rbind, tally)
# clr N
# 1: blue 6
# 2: blue 5
# 3: blue 5
# 4: blue 5
# 5: blue 5
# 6: blue 4
我想这是一个实际的data.table解决方案。利用data.table的fast
.N
计算行频率
library(data.table)
flip <- data.table(t(mat))
tally <- lapply(names(flip),
function(x) {
setnames(flip[, .N, by=eval(x)][order(-N)][1,],
c('clr', 'N')) } )
do.call(rbind, tally)
# clr N
# 1: blue 6
# 2: blue 5
# 3: blue 5
# 4: blue 5
# 5: blue 5
# 6: blue 4
你能显示你目前正在使用的代码作为比较吗?多长时间是“太长”?你需要多快完成?如果你不能回答这个问题,那么我不认为你可以说“太长”有多长。尽管有人发布了解决方案,大大加快了速度——代码通常在40秒左右的时间内运行。这个解决方案只需要大约一秒钟,这是完美的:-)。您可以显示当前用作比较的代码吗?多长时间是“太长”?你需要多快完成?如果你不能回答这个问题,那么我不认为你可以说“太长”有多长。尽管有人发布了解决方案,大大加快了速度——代码通常在40秒左右的时间内运行。这个解决方案只需要一秒钟,这是完美的:-)。
vapply(值、函数(值)行和(dat==value)、numeric(nrow(dat))
甚至可能比sapply
@konvas更快。如果最大计数之间存在关联,max.col
似乎可以任意选择其中一个。有没有找到所有最大值的max.col
等价物?@RichardScriven说得好!这将大大提高速度。@Khashaa看一看?max.col
。您可以调整ties.method
参数,但只有三个选项可用-随机、第一个和最后一个。你到底想到了什么?只是如果它有ties就好了。method=all
vapply(值,函数(值)行和(dat==value),numeric(nrow(dat))
甚至可能比sapply
@konvas更快,如果最大计数之间有联系,max.col
似乎任意选择其中一个。有没有找到所有最大值的max.col
等价物?@RichardScriven说得好!这将大大提高速度。@Khashaa看一看?max.col
。您可以调整ties.method
参数,但只有三个选项可用-随机、第一个和最后一个。你到底在想什么?只是如果它有ties.method=all
.eh就太好了,但是t()运算在大矩阵上很慢。没关系,但是t()运算在大型矩阵上很慢。不要介意