R 按组提取变量最小值对应的行_R_Dplyr_Data.table_Aggregate

R 按组提取变量最小值对应的行

R 按组提取变量最小值对应的行,r,dplyr,data.table,aggregate,R,Dplyr,Data.table,Aggregate,我希望（1）按一个变量（State）对数据进行分组，（2）在每个组中查找另一个变量的最小值行（Employees），以及（3）提取整行（1）和（2）是简单的一行，我觉得（3）也应该是，但我不能得到它以下是一个示例数据集： > data State Company Employees 1 AK A 82 2 AK B 104 3 AK C 37 4 AK D

我希望（1）按一个变量（

State

）对数据进行分组，（2）在每个组中查找另一个变量的最小值行（

Employees

），以及（3）提取整行

（1）和（2）是简单的一行，我觉得（3）也应该是，但我不能得到它

以下是一个示例数据集：

> data
  State Company Employees
1    AK       A        82
2    AK       B       104
3    AK       C        37
4    AK       D        24
5    RI       E        19
6    RI       F       118
7    RI       G        88
8    RI       H        42

data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
        2L), .Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A", 
        "B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L, 
        104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c("State", "Company", 
        "Employees"), class = "data.frame", row.names = c(NA, -8L))

…或

数据。表：
> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
   State Employees
1:    AK        24
2:    RI        19

>库（data.table）
>DT[，列表（员工=min（员工）），按=州]
国家雇员
1:AK 24
2:RI 19

但是如何提取与这些min
值对应的整行，即在结果中也包括公司
 稍微优雅一点：
library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

   State Company Employees
1:    AK       D        24
2:    RI       E        19


与使用.SD
相比，稍微不那么优雅，但速度要快一点（对于包含多个组的数据）：
另外，如果您的数据集有多个相同的最小值，并且您希望将它们全部子集，则只需将表达式which.min（Employees）
替换为Employees==min（Employees）

另请参见。
稍微优雅一点：
library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

   State Company Employees
1:    AK       D        24
2:    RI       E        19


与使用.SD
相比，稍微不那么优雅，但速度要快一点（对于包含多个组的数据）：
另外，如果您的数据集有多个相同的最小值，并且您希望将它们全部子集，则只需将表达式which.min（Employees）
替换为Employees==min（Employees）

另请参见。
基本函数通常用于处理data.frames中的块数据。比如说
by(data, data$State, function(x) x[which.min(x$Employees), ] )

它确实返回列表中的数据，但您可以使用
do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))

基本函数by
通常用于处理data.frames中的块数据。比如说
by(data, data$State, function(x) x[which.min(x$Employees), ] )

它确实返回列表中的数据，但您可以使用
do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))

这里有一个dplyr
解决方案（请注意，我不是一个普通用户）：
这里有一个dplyr
解决方案（请注意，我不是一个普通用户）：
由于这是谷歌的热门产品，我想我会添加一些我觉得有用的额外选项。这个想法基本上是由员工安排一次，然后根据状态

使用data.table

library(data.table)
unique(setDT(data)[order(Employees)], by = "State")
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24


或者，我们也可以先下订单，然后再下子集.SD
。这两种操作在最近的数据中都得到了优化。表版本和顺序似乎触发了数据。表：：：forerv
，而.SD[1L]
触发了Gforce

setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
# i clause present and columns used in by detected, only these subset: State 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
# GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
# Making each group and running j (GForce TRUE) ... 0 secs
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24


从awesome answer中借用的另一个有趣的想法（以mult=“first”
的形式进行了一个小的修改，以处理多个匹配）是首先找到每个组的最小值，然后执行二进制连接。这样做的优点是利用了data.tablesgmin
函数（它跳过了计算开销）和二进制连接特性
tmp%切片（1），
“（plyr）ddply/which.min:”=ddply（数据、（状态）、函数（x）x[which.min（x$Employees），），
“（base）by:”=do.call（rbind，by（data，data$State，function（x）x[which.min（x$Employees），]））
#单位：毫秒
#expr最小lq平均uq最大neval cld
#（数据表）.SD[哪个最小值]：119.66086 125.49202 145.57369 129.61172 152.02872 267.5713 100 d
#（数据表）.I[哪个最小值]：12.84948 13.66673 19.51432 13.97584 15.17900 109.5438 100 a
#（数据表）顺序/唯一性：52.91915 54.63989 64.39212 59.15254 61.71133 177.1248 100 b
#（data.table）order/.SD[1L]：51.41872 53.22794 58.17123 55.00228 59.00966 145.0341 100 b
#（data.table）自联接（on）：44.37256 45.67364 50.32378 46.24578 50.69411 137.4724 100 b
#（data.table）自连接（设置键）：14.3054315.2892418.6373915.5866716.01017106.0069100A
#（dplyr）切片（which.min）：82.60453 83.64146 94.06307 84.82078 90.09772 186.0848 100 c
#（dplyr）排列/区分：344.81603360.09167385.52661379.55676395.29463491.3893 100 e
#（dplyr）排列/分组单位/切片：367.95924 383.52719 414.99081 397.93646 425.92478 557.9553 100 f
#（plyr）ddply/which.min:506.55354 530.22569 568.99493 552.65068 601.04582 727.9248 100克
#（基数）乘：1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377 100小时
由于这是谷歌的热门产品，我想我会添加一些我认为有用的附加选项。这个想法基本上是由员工安排一次，然后根据状态

使用data.table

library(data.table)
unique(setDT(data)[order(Employees)], by = "State")
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24


或者，我们也可以先下订单，然后再下子集.SD
。这两种操作在最近的数据中都得到了优化。表版本和顺序似乎触发了数据。表：：：forerv
，而.SD[1L]
触发了Gforce

setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
# i clause present and columns used in by detected, only these subset: State 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
# GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
# Making each group and running j (GForce TRUE) ... 0 secs
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24


从awesome answer中借用的另一个有趣的想法（以mult=“first”
的形式进行了一个小的修改，以处理多个匹配）是首先找到每个组的最小值，然后执行二进制连接。这样做的优点是利用了data.tablesgmin
函数（它跳过了计算开销）和二进制连接特性
tmp%切片（1），
“（plyr）ddply/which.min:”=ddply（数据、（状态）、函数（x）x[which.min（x$Employees），），
“（base）by:”=do.call（rbind，by（data，data$State，function（x）x[which.min（x$Employees），]））
#单位：毫秒
#expr最小lq平均uq最大neval cld
#（数据表）.SD[which.min]：119.660
library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)

set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
                   Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)

microbenchmark("(data.table) .SD[which.min]: " = DT[ , .SD[which.min(Employees)], by = State],
               "(data.table) .I[which.min]: " = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
               "(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
               "(data.table) order/.SD[1L]: " = DT[order(Employees), .SD[1L], by = State],
               "(data.table) self join (on):" = {
                 tmp <- DT[, .(Employees = min(Employees)), by = State]
                 DT[tmp, on = .(State, Employees), mult = "first"]},
               "(data.table) self join (setkey):" = {
                 tmp <- DT2[, .(Employees = min(Employees)), by = State] 
                 setkey(tmp, State, Employees)
                 setkey(DT2, State, Employees)
                 DT2[tmp, mult = "first"]},
               "(dplyr) slice(which.min): " = data %>% group_by(State) %>% slice(which.min(Employees)),
               "(dplyr) arrange/distinct: " = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
               "(dplyr) arrange/group_by/slice: " = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
               "(plyr) ddply/which.min: " = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
               "(base) by: " = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))


# Unit: milliseconds
#                             expr        min         lq       mean     median         uq       max neval      cld
#    (data.table) .SD[which.min]:   119.66086  125.49202  145.57369  129.61172  152.02872  267.5713   100    d    
#     (data.table) .I[which.min]:    12.84948   13.66673   19.51432   13.97584   15.17900  109.5438   100 a       
#      (data.table) order/unique:    52.91915   54.63989   64.39212   59.15254   61.71133  177.1248   100  b      
#     (data.table) order/.SD[1L]:    51.41872   53.22794   58.17123   55.00228   59.00966  145.0341   100  b      
#     (data.table) self join (on):   44.37256   45.67364   50.32378   46.24578   50.69411  137.4724   100  b      
# (data.table) self join (setkey):   14.30543   15.28924   18.63739   15.58667   16.01017  106.0069   100 a       
#       (dplyr) slice(which.min):    82.60453   83.64146   94.06307   84.82078   90.09772  186.0848   100   c     
#       (dplyr) arrange/distinct:   344.81603  360.09167  385.52661  379.55676  395.29463  491.3893   100     e   
# (dplyr) arrange/group_by/slice:   367.95924  383.52719  414.99081  397.93646  425.92478  557.9553   100      f  
#         (plyr) ddply/which.min:   506.55354  530.22569  568.99493  552.65068  601.04582  727.9248   100       g 
#                      (base) by:  1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377   100        h

ddply(df, .(State), function(x) x[which.min(x$Employees),])
#   State Company Employees
# 1    AK       D        24
# 2    RI       E        19

data[data$Employees == ave(data$Employees, data$State, FUN=min),]
#  State Company Employees
#4    AK       D        24
#5    RI       E        19

data[as.logical(ave(data$Employees, data$State, FUN=function(x) x==min(x))),]
#data[ave(data$Employees, data$State, FUN=function(x) x==min(x))==1,] #Variant
#  State Company Employees
#4    AK       D        24
#5    RI       E        19

library(collapse)
library(magrittr)
data %>% 
  fgroup_by(State) %>% 
  fsummarise(Employees = fmin(Employees))