Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/powershell/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R、 根据某列的排名选择行_R - Fatal编程技术网

R、 根据某列的排名选择行

R、 根据某列的排名选择行,r,R,我有一个如下所示的R数据帧 name score marry 98 marry 77 marry 87 marry 96 mark 99 mark 44 mark 79 john 87 john 77 对于每个名称,我想选择2分最高的行,应该是 name score marry 98 marry 96 mark 99 mark 79 john 87 john 77 有人能帮忙吗? 非常感谢 您可以尝试: devtools::install_github("hadley/dplyr") l

我有一个如下所示的R数据帧

name score
marry 98
marry 77
marry 87
marry 96
mark 99
mark 44
mark 79
john 87
john 77
对于每个名称,我想选择2分最高的行,应该是

name score
marry 98
marry 96
mark 99
mark 79
john 87
john 77
有人能帮忙吗? 非常感谢

您可以尝试:

 devtools::install_github("hadley/dplyr")
 library(dplyr)


 df %>% 
      group_by(name) %>% 
      arrange(desc(score)) %>%
       slice(1:2)

 #     name score
 #1  john    87
 #2  john    77
 #3  mark    99
 #4  mark    79
 #5 marry    98
 #6 marry    96
或使用
data.table

 library(data.table)
 setDT(df)[order(-score), .SD[1:2], by=name]
 #      name score
 #1:  mark    99
 #2:  mark    79
 #3: marry    98
 #4: marry    96
 #5:  john    87
 #6:  john    77
数据
  • 功能

      aMahto <- function(mydf) {mydf[with(mydf, 
                 ave(-score, name, FUN = order)) %in% c(1, 2), ]
               }
    
      akrun1 <- function(mydf) {setDT(mydf)[order(-score), .SD[1:2], by=name] }
      akrun2 <- function(mydf) {setDT(mydf)[order(-score), head(.SD,2), by=name] }
      dArenburg <- function(mydf){ setorder(setDT(mydf), -score)[,
                                                head(.SD,2), by=name]}
      akrun3 <- function(mydf) { mydf %>% group_by(name) %>% 
                                   arrange(desc(score)) %>% slice(1:2) }
    
    
      rScriven1 <- function(mydf) {sapply(split(mydf$score, mydf$name),
                                           function(x) tail(sort(x), 2))}
      rScriven2 <- function(mydf) {stack(lapply(split(mydf$score, mydf$name),
                                            function(x) tail(sort(x), 2)))}
    
  • 在更大的数据集上,@David Arenburg的方法是
    赢家

        microbenchmark(aMahto(df2), akrun1(df2), akrun2(df2), akrun3(df2), 
                     dArenburg(df2), rScriven1(df2), rScriven2(df2), times=40L)
        Unit: seconds
                expr       min        lq      mean    median        uq       max
         aMahto(df2) 11.830111 12.027325 12.273881 12.213140 12.533628 13.196659
         akrun1(df2)  6.672874  6.890442  7.018749  6.956716  7.128060  7.542047
         akrun2(df2)  3.794502  3.829567  3.860565  3.847690  3.869065  4.143381
         akrun3(df2)  3.687974  3.725867  3.801861  3.743973  3.933935  4.102295
      dArenburg(df2)  1.531356  1.598570  1.647648  1.618573  1.640258  2.716042
      rScriven1(df2)  6.370144  6.573998  6.685313  6.616246  6.820830  7.118827
      rScriven2(df2)  6.551911  6.628134  6.743644  6.724310  6.867090  7.091750
         neval
          40
          40
          40
          40
          40
          40
          40
    

    这是一个不同的输出,但是这样的话名称就不会重复了

    sapply(split(df$score, df$name), function(x) tail(sort(x), 2))
    #      john mark marry
    # [1,]   77   79    96
    # [2,]   87   99    98
    
    正如Ananda Mahto所建议的,您也可以将
    stack
    lappy

    stack(lapply(split(df$score, df$name), function(x) tail(sort(x), 2)))
    #   values   ind
    # 1     77  john
    # 2     87  john
    # 3     79  mark
    # 4     99  mark
    # 5     96 marry
    # 6     98 marry
    

    以下是一种可能的基本方法:

    mydf[with(mydf, ave(-score, name, FUN = order)) %in% c(1, 2), ]
    #    name score
    # 1 marry    98
    # 4 marry    96
    # 5  mark    99
    # 7  mark    79
    # 8  john    87
    # 9  john    77
    

    对于好奇的人,在时间上--这里有一个小测试

    两个样本数据集,都是1M行,两列,但其中一个“name”只有1000个可能值,另一个有10000个可能值

    set.seed(1)
    df1 <- data.frame(
      name = sample(1000, 1000000, TRUE),
      score = sample(0:100, 1000000, TRUE)
    )
    df2 <- data.frame(
      name = sample(10000, 1000000, TRUE),
      score = sample(0:100, 1000000, TRUE)
    )
    

    下面是另一个使用新的
    setorder
    函数(按引用排序)的
    data.table
    方法


    +1关于“数据表”方法。我在你发布这篇文章的时候,发布了与之相当的基本R,我猜:-)+1只是好奇,你知道这篇文章比我发布的要快吗?@akrun,当然是<代码>设置顺序应该更有效。除非
    击败
    head
    ,这将是一个有趣的检查感谢
    setorder
    @akrun,没有冒犯,但我的
    ave
    解决方案比“data.table”一行快,至少高达~1M行,但David的始终是最快的。@akrun,它可能还取决于唯一“name”的数量值等等--
    setorder
    就是解决这个问题的方法。戴维,+ 1 + 1,但您可能需要考虑<代码> LePix<代码>,而不是<代码> SpIs< /Cord>。它的速度更快,而且可以让你做到:
    堆栈(lappy(split(df$score,df$name),函数(x)tail(sort(x),2))
    。它的速度不如
    unlist
    ,但应该比
    simplify2array
    (+1)快,但我不喜欢你的
    作为.data.table
    。我认为
    setDT
    效率更高(但我可能错了,因为
    df
    以前不存在)。我还认为,您应该根据函数的作者来命名函数,这样就可以清楚地知道谁是赢家;)@戴维登堡,我不认为这有什么不同。要么这样,要么使用
    copy
    ,因为我们真正感兴趣的是测试如何处理排序(除非我不正确地接近基准测试)。好的,至少做一行,类似于
    setorder(as.data.table(mydf),-score)[,head(.SD,2),by=name]
    @Ananda Mahto I更新了基准测试,包括
    dplyr
    stack(lapply(split(df$score, df$name), function(x) tail(sort(x), 2)))
    #   values   ind
    # 1     77  john
    # 2     87  john
    # 3     79  mark
    # 4     99  mark
    # 5     96 marry
    # 6     98 marry
    
    mydf[with(mydf, ave(-score, name, FUN = order)) %in% c(1, 2), ]
    #    name score
    # 1 marry    98
    # 4 marry    96
    # 5  mark    99
    # 7  mark    79
    # 8  john    87
    # 9  john    77
    
    set.seed(1)
    df1 <- data.frame(
      name = sample(1000, 1000000, TRUE),
      score = sample(0:100, 1000000, TRUE)
    )
    df2 <- data.frame(
      name = sample(10000, 1000000, TRUE),
      score = sample(0:100, 1000000, TRUE)
    )
    
    fun1 <- function(mydf) {
      mydf[with(mydf, ave(-score, name, FUN = order)) %in% c(1, 2), ]
    }
    
    fun2 <- function(mydf) {
      as.data.table(mydf)[order(-score), .SD[1:2], by=name]
    }
    
    fun3 <- function(mydf) {
      df <- as.data.table(mydf)
      setorder(df, -score)[, head(.SD, 2), by = name]
    }
    
    library(microbenchmark)
    microbenchmark(fun1(df1), fun2(df1), fun3(df1), 
                   fun1(df2), fun2(df2), fun3(df2), times = 20)
    # Unit: milliseconds
    #       expr        min         lq       mean     median         uq       max neval
    #  fun1(df1)  502.76809  513.98317  569.47883  597.90488  603.34458  686.4302    20
    #  fun2(df1)  733.12544  741.18777  796.67106  822.60824  828.88449  839.3837    20
    #  fun3(df1)   87.80581   93.07012   95.34281   95.56374   97.49608  101.7991    20
    #  fun1(df2)  672.60241  764.10237  764.60365  772.33959  780.14679  799.3505    20
    #  fun2(df2) 6338.14881 6360.42621 6407.66675 6412.99278 6451.75626 6479.2681    20
    #  fun3(df2)  354.24119  366.47396  382.58666  369.78597  374.01897  468.9197    20
    
    library(data.table) # 1.9.4+
    setorder(setDT(df), -score)[, head(.SD, 2), by = name]
    #     name score
    # 1:  mark    99
    # 2:  mark    79
    # 3: marry    98
    # 4: marry    96
    # 5:  john    87
    # 6:  john    77