R 如何删除每列最大值为2的行,并使左侧数据帧尽可能长?
我想删除每列的maxR 如何删除每列最大值为2的行,并使左侧数据帧尽可能长?,r,R,我想删除每列的max2值(异常值),然后分析左数据帧 > data.frame(q1 = c(2, 4, 5,8,8), q2 = c(1, 6, 3,8,5), q3 = c(5, 3, 6,5,2)) q1 q2 q3 1 2 1 5 2 4 6 3 3 5 3 6 4 8 8 5 5 8 5 2 q1中的最大2值:8,8,然后应删除第5,4行 q2中的最大2值:8,6,然后应删除第4,2行 应删除q3中的最大值2:6,5,然后删除第3,4行(不是1
2
值(异常值),然后分析左数据帧
> data.frame(q1 = c(2, 4, 5,8,8), q2 = c(1, 6, 3,8,5), q3 = c(5, 3, 6,5,2))
q1 q2 q3
1 2 1 5
2 4 6 3
3 5 3 6
4 8 8 5
5 8 5 2
q1
中的最大2值:8,8,然后应删除第5,4行q2
中的最大2值:8,6,然后应删除第4,2行应删除
q3
中的最大值2:6,5,然后删除第3,4行(不是1,以尽可能长地保留左侧数据帧,这意味着删除行的时间越短越好)
预期结果如下:
q1 q2 q3
1 2 1 5
如何操作?每个向量列中的两个或
n
最大值通过辅助函数max2vals
找到。函数将lappy
ed添加到数据帧,并删除相应的行
max2vals <- function(x, n = 2){
tail(order(x), n)
}
df1 <- data.frame(q1 = c(2, 4, 5,8,8),
q2 = c(1, 6, 3,8,5),
q3 = c(5, 3, 6,5,2))
i <- unique(unlist(lapply(df1, max2vals)))
df1[-i,, drop = FALSE]
# q1 q2 q3
#1 2 1 5
max2vals我们可以使用这里有一个答案,我认为这个问题值得回答,但可能不是它需要的答案(蝙蝠侠?!)
这是一种混合整数规划(MIP)方法,使用ompr
进行数学建模,使用glpk
作为求解器。我已经在代码注释中包含了逐步逻辑
请注意,ompr
和gplk
对于2000 x 50
数据集来说都太慢了@对于同一个数据集,jay.sf解决方案花费了5分钟,但当我试图删除每列前3名时,它的内存很容易耗尽,因此确实有其局限性
我的建议-如果最优性很重要,请研究MIP选项(也可以尝试python),否则@jay.sf带有小n
的解决方案就足够了
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
remove_rows <- function(df, n = 2) {
# mark top n values in every column
df_logical <- df %>%
mutate_all(~. %in% sort(., decreasing = T)[1:n])
# marks which rows are removable i.e any row with at least one 1 in it
xij_bounds <- as.matrix(unname(+df_logical))
a <- nrow(df) # number of rows
b <- ncol(df) # number of columns
MIPModel() %>%
# x[i,j] is 1 when cell i,j is selected for removal else 0
add_variable(x[i,j], i = 1:a, j = 1:b, type = "binary") %>%
# y[i] is 1 when row i is selected for removal else 0
add_variable(y[i], i = 1:a, type = "binary") %>%
# objective is minimize number of removed rows
set_objective(sum_expr(y[i], i = 1:a), "min") %>%
# y[i] = 1 when any x[i,j] = 1 for any i
add_constraint(y[i] - x[i,j] >= 0, i = 1:a, j = 1:b) %>%
# at least remove n values from every column j
add_constraint(sum_expr(x[i,j], i = 1:a) >= n, j = 1:b) %>%
# x[i,j] can be 1 only when cell i,j belongs to top n values of column j
add_constraint(x[i,j] - xij_bounds[i,j] <= 0, i = 1:a, j = 1:b) %>%
# solve model
solve_model(with_ROI(solver = "glpk")) %>%
# get indices of rows to remove i.e. where y[i] = 1
get_solution(y[i]) %>%
filter(value > 0) %>%
pull(i)
}
例2-
例3-
#根据您的评论,我尝试了以下dim 2000x50的数据帧
种子(2)
df3请注意,“保持左数据帧尽可能长”使这成为一个优化问题,对于这个问题,随着行数和列数的增加,简单的方法无法保证最佳解决方案。@Shree,是的,这就是为什么我在这里问您的实际数据帧的大小是多少?@Shree,50列,2000 rowsIt在这里只是一个共同关联,order
返回与OP的预期输出相同的输出。在df1上试试这个,同意@RonakShah。然而,我不认为会有任何更简单的解决方案,因为根据实际数据帧的大小,可能要检查的组合将快速增长。贪婪的方法可能相当简单,尽管它不能保证最优解。如果我只删除最大值,用combn(seq(I),1,simplify=FALSE)更改函数,rmfun(d1)
将得到错误'x'必须是至少两维的数组
d1
# q1 q2 q3
# 1 2 1 5
# 2 4 6 3
# 3 5 3 6
# 4 8 8 5
# 5 8 5 2
d2
# q1 q2 q3
# 1 8 8 5
# 2 4 6 3
# 3 5 3 6
# 4 1 8 5
# 5 2 8 2
rmfun(d1)
# q1 q2 q3
# 1 2 1 5
rmfun(d2)
# q1 q2 q3
# 2 4 6 3
# 5 2 8 2
d1 <- structure(list(q1 = c(2, 4, 5, 8, 8), q2 = c(1, 6, 3, 8, 5),
q3 = c(5, 3, 6, 5, 2)), class = "data.frame", row.names = c(NA,
-5L))
d2 <- structure(list(q1 = c(8, 4, 5, 1, 2), q2 = c(8, 6, 3, 8, 8),
q3 = c(5, 3, 6, 5, 2)), class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
remove_rows <- function(df, n = 2) {
# mark top n values in every column
df_logical <- df %>%
mutate_all(~. %in% sort(., decreasing = T)[1:n])
# marks which rows are removable i.e any row with at least one 1 in it
xij_bounds <- as.matrix(unname(+df_logical))
a <- nrow(df) # number of rows
b <- ncol(df) # number of columns
MIPModel() %>%
# x[i,j] is 1 when cell i,j is selected for removal else 0
add_variable(x[i,j], i = 1:a, j = 1:b, type = "binary") %>%
# y[i] is 1 when row i is selected for removal else 0
add_variable(y[i], i = 1:a, type = "binary") %>%
# objective is minimize number of removed rows
set_objective(sum_expr(y[i], i = 1:a), "min") %>%
# y[i] = 1 when any x[i,j] = 1 for any i
add_constraint(y[i] - x[i,j] >= 0, i = 1:a, j = 1:b) %>%
# at least remove n values from every column j
add_constraint(sum_expr(x[i,j], i = 1:a) >= n, j = 1:b) %>%
# x[i,j] can be 1 only when cell i,j belongs to top n values of column j
add_constraint(x[i,j] - xij_bounds[i,j] <= 0, i = 1:a, j = 1:b) %>%
# solve model
solve_model(with_ROI(solver = "glpk")) %>%
# get indices of rows to remove i.e. where y[i] = 1
get_solution(y[i]) %>%
filter(value > 0) %>%
pull(i)
}
df1
q1 q2 q3
1 2 1 5
2 4 6 3
3 5 3 6
4 8 8 5
5 8 5 2
remove_rows(df1)
[1] 2 3 4 5
df1[-remove_rows(df1), ]
q1 q2 q3
1 2 1 5
df2
q1 q2 q3
1 8 8 5
2 4 6 3
3 5 3 6
4 1 8 5
5 2 8 2
remove_rows(df2)
[1] 1 3 4
df2[-remove_rows(df2), ]
q1 q2 q3
2 4 6 3
5 2 8 2
# Based on your comment I tried with below dataframe of dim 2000x50
set.seed(2)
df3 <- data.frame(replicate(50, sample(2000)))
# Both, ompr modeling and glpk solver, are too slow for my liking
# checking time with @jay.sf solution for top 2
system.time(result <- rmfun(df3, n = 2))
user system elapsed
298.90 0.01 300.24
# it runs out of memory for top 3
system.time(result <- rmfun(df3, n = 3))
Error: cannot allocate vector of size 9.9 Gb