R 如何删除每列最大值为2的行,并使左侧数据帧尽可能长?

R 如何删除每列最大值为2的行,并使左侧数据帧尽可能长?,r,R,我想删除每列的max2值(异常值),然后分析左数据帧 > data.frame(q1 = c(2, 4, 5,8,8), q2 = c(1, 6, 3,8,5), q3 = c(5, 3, 6,5,2)) q1 q2 q3 1 2 1 5 2 4 6 3 3 5 3 6 4 8 8 5 5 8 5 2 q1中的最大2值:8,8,然后应删除第5,4行 q2中的最大2值:8,6,然后应删除第4,2行 应删除q3中的最大值2:6,5,然后删除第3,4行(不是1

我想删除每列的max
2
值(异常值),然后分析左数据帧

> data.frame(q1 = c(2, 4, 5,8,8), q2 = c(1, 6, 3,8,5), q3 = c(5, 3, 6,5,2))
  q1 q2 q3
1  2  1  5
2  4  6  3
3  5  3  6
4  8  8  5
5  8  5  2
q1
中的最大2值:8,8,然后应删除第5,4行
q2
中的最大2值:8,6,然后应删除第4,2行
应删除
q3
中的最大值2:6,5,然后删除第3,4行(不是1,以尽可能长地保留左侧数据帧,这意味着删除行的时间越短越好)

预期结果如下:

  q1 q2 q3
1  2  1  5

如何操作?

每个向量列中的两个或
n
最大值通过辅助函数
max2vals
找到。函数将
lappy
ed添加到数据帧,并删除相应的行

max2vals <- function(x, n = 2){
  tail(order(x), n)
}


df1 <- data.frame(q1 = c(2, 4, 5,8,8), 
                  q2 = c(1, 6, 3,8,5), 
                  q3 = c(5, 3, 6,5,2))

i <- unique(unlist(lapply(df1, max2vals)))
df1[-i,, drop = FALSE]
#  q1 q2 q3
#1  2  1  5

max2vals我们可以使用
这里有一个答案,我认为这个问题值得回答,但可能不是它需要的答案(蝙蝠侠?!)

这是一种混合整数规划(MIP)方法,使用
ompr
进行数学建模,使用
glpk
作为求解器。我已经在代码注释中包含了逐步逻辑

请注意,
ompr
gplk
对于
2000 x 50
数据集来说都太慢了@对于同一个数据集,jay.sf解决方案花费了5分钟,但当我试图删除每列前3名时,它的内存很容易耗尽,因此确实有其局限性

我的建议-如果最优性很重要,请研究MIP选项(也可以尝试python),否则@jay.sf带有小
n
的解决方案就足够了

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

remove_rows <- function(df, n = 2) {
  # mark top n values in every column
  df_logical <- df %>% 
    mutate_all(~. %in% sort(., decreasing = T)[1:n])

  # marks which rows are removable i.e any row with at least one 1 in it
  xij_bounds <- as.matrix(unname(+df_logical))

  a <- nrow(df) # number of rows
  b <- ncol(df) # number of columns

  MIPModel() %>%
    # x[i,j] is 1 when cell i,j is selected for removal else 0
    add_variable(x[i,j], i = 1:a, j = 1:b, type = "binary") %>%
    # y[i] is 1 when row i is selected for removal else 0
    add_variable(y[i], i = 1:a, type = "binary") %>% 
    # objective is minimize number of removed rows
    set_objective(sum_expr(y[i], i = 1:a), "min") %>%
    # y[i] = 1 when any x[i,j] = 1 for any i
    add_constraint(y[i] - x[i,j] >= 0, i = 1:a, j = 1:b) %>% 
    # at least remove n values from every column j
    add_constraint(sum_expr(x[i,j], i = 1:a) >= n, j = 1:b) %>%
    # x[i,j] can be 1 only when cell i,j belongs to top n values of column j
    add_constraint(x[i,j] - xij_bounds[i,j] <= 0, i = 1:a, j = 1:b) %>% 
    # solve model
    solve_model(with_ROI(solver = "glpk")) %>% 
    # get indices of rows to remove i.e. where y[i] = 1
    get_solution(y[i]) %>% 
    filter(value > 0) %>% 
    pull(i)
}
例2- 例3-
#根据您的评论,我尝试了以下dim 2000x50的数据帧
种子(2)

df3请注意,“保持左数据帧尽可能长”使这成为一个优化问题,对于这个问题,随着行数和列数的增加,简单的方法无法保证最佳解决方案。@Shree,是的,这就是为什么我在这里问您的实际数据帧的大小是多少?@Shree,50列,2000 rowsIt在这里只是一个共同关联,
order
返回与OP的预期输出相同的输出。在
df1上试试这个,同意@RonakShah。然而,我不认为会有任何更简单的解决方案,因为根据实际数据帧的大小,可能要检查的组合将快速增长。贪婪的方法可能相当简单,尽管它不能保证最优解。如果我只删除最大值,用
combn(seq(I),1,simplify=FALSE)更改函数
rmfun(d1)
将得到错误
'x'必须是至少两维的数组

d1
#   q1 q2 q3
# 1  2  1  5
# 2  4  6  3
# 3  5  3  6
# 4  8  8  5
# 5  8  5  2

d2
#   q1 q2 q3
# 1  8  8  5
# 2  4  6  3
# 3  5  3  6
# 4  1  8  5
# 5  2  8  2
rmfun(d1)
#   q1 q2 q3
# 1  2  1  5

rmfun(d2)
#   q1 q2 q3
# 2  4  6  3
# 5  2  8  2
d1 <- structure(list(q1 = c(2, 4, 5, 8, 8), q2 = c(1, 6, 3, 8, 5), 
    q3 = c(5, 3, 6, 5, 2)), class = "data.frame", row.names = c(NA, 
-5L))
d2 <- structure(list(q1 = c(8, 4, 5, 1, 2), q2 = c(8, 6, 3, 8, 8), 
    q3 = c(5, 3, 6, 5, 2)), class = "data.frame", row.names = c(NA, 
-5L))
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

remove_rows <- function(df, n = 2) {
  # mark top n values in every column
  df_logical <- df %>% 
    mutate_all(~. %in% sort(., decreasing = T)[1:n])

  # marks which rows are removable i.e any row with at least one 1 in it
  xij_bounds <- as.matrix(unname(+df_logical))

  a <- nrow(df) # number of rows
  b <- ncol(df) # number of columns

  MIPModel() %>%
    # x[i,j] is 1 when cell i,j is selected for removal else 0
    add_variable(x[i,j], i = 1:a, j = 1:b, type = "binary") %>%
    # y[i] is 1 when row i is selected for removal else 0
    add_variable(y[i], i = 1:a, type = "binary") %>% 
    # objective is minimize number of removed rows
    set_objective(sum_expr(y[i], i = 1:a), "min") %>%
    # y[i] = 1 when any x[i,j] = 1 for any i
    add_constraint(y[i] - x[i,j] >= 0, i = 1:a, j = 1:b) %>% 
    # at least remove n values from every column j
    add_constraint(sum_expr(x[i,j], i = 1:a) >= n, j = 1:b) %>%
    # x[i,j] can be 1 only when cell i,j belongs to top n values of column j
    add_constraint(x[i,j] - xij_bounds[i,j] <= 0, i = 1:a, j = 1:b) %>% 
    # solve model
    solve_model(with_ROI(solver = "glpk")) %>% 
    # get indices of rows to remove i.e. where y[i] = 1
    get_solution(y[i]) %>% 
    filter(value > 0) %>% 
    pull(i)
}
df1
  q1 q2 q3
1  2  1  5
2  4  6  3
3  5  3  6
4  8  8  5
5  8  5  2

remove_rows(df1)
[1] 2 3 4 5

df1[-remove_rows(df1), ]
  q1 q2 q3
1  2  1  5
df2
  q1 q2 q3
1  8  8  5
2  4  6  3
3  5  3  6
4  1  8  5
5  2  8  2

remove_rows(df2)
[1] 1 3 4

df2[-remove_rows(df2), ]
  q1 q2 q3
2  4  6  3
5  2  8  2
# Based on your comment I tried with below dataframe of dim 2000x50
set.seed(2)
df3 <- data.frame(replicate(50, sample(2000)))

# Both, ompr modeling and glpk solver, are too slow for my liking

# checking time with @jay.sf solution for top 2
system.time(result <- rmfun(df3, n = 2))
   user  system elapsed 
 298.90    0.01  300.24 

# it runs out of memory for top 3
system.time(result <- rmfun(df3, n = 3))
Error: cannot allocate vector of size 9.9 Gb