基于R中的列聚合字符串，仅保留first/last_R_Aggregate_Plyr

基于R中的列聚合字符串，仅保留first/last

基于R中的列聚合字符串，仅保留first/last,r,aggregate,plyr,R,Aggregate,Plyr,我有一个这样的虚拟数据集： x y 1 1 test1 2 2 test2 3 2 test3 4 3 test4 5 3 test5 我想根据x中的值对其进行聚合，但不是串联或运行最大频率检查，我只想显示x值的最后/第一个值（基于行号）。我想知道如何同时显示最后一个值和第一个值。仅仅根据x删除重复项并不能让我灵活地选择y的哪个值我的输出如下（最后）：或者像这样（首先）：我有一个超过1M行的大型数据集。我们将不胜感激。我尝试过聚合和ddply方法您可以使用dplyr:：

我有一个这样的虚拟数据集：

  x  y
1 1  test1
2 2  test2
3 2  test3
4 3  test4
5 3  test5

我想根据

中的值对其进行聚合，但不是串联或运行最大频率检查，我只想显示

值的最后/第一个值（基于行号）。我想知道如何同时显示最后一个值和第一个值。仅仅根据

删除重复项并不能让我灵活地选择

的哪个值

我的输出如下（最后）：

或者像这样（首先）：

我有一个超过1M行的大型数据集。我们将不胜感激。我尝试过聚合和ddply方法

您可以使用

dplyr:：distinct（）

，它根据变量保留唯一的行，如果您指定

。将所有

参数保持为

TRUE

，您将获得指定变量的每个不同值的第一行：

要获得第一个：

library(dplyr)
df %>% 
      distinct(x, .keep_all = TRUE)

#  x     y
#1 1 test1
#2 2 test2
#3 3 test4

要获取最后一行，您可以通过使用

行编号（）

按降序对数据帧进行排序，然后使用

distinct（）

来反转数据帧：

您可以使用重复的


df[!duplicated(df$x, fromLast=TRUE),]
  x     y
1 1 test1
3 2 test3
5 3 test5

df[!duplicated(df$x),]
  x     y
1 1 test1
2 2 test2
4 3 test4

或者，您可以使用data.table
，因为您说过您的数据非常大。我已经给出了两个例子，对于每个第一个/最后一个值，它们都给出了相同的结果。使用setkey
的方法会更快
library(data.table)

第一个值
方法1：
方法2：
最后一个值
方法1：
方法2：
数据
dt <- structure(list(x = c(1L, 2L, 2L, 3L, 3L), y = structure(1:5, .Label = c("test1", 
"test2", "test3", "test4", "test5"), class = "factor")), .Names = c("x", 
"y"), class = c("data.table", "data.frame"), row.names = c(NA, 
-5L), .internal.selfref = <pointer: 0x0000000000140788>)

dt您对“选择r中的第一个或最后一个值”的搜索是如何进行的？
df[!duplicated(df$x, fromLast=TRUE),]
  x     y
1 1 test1
3 2 test3
5 3 test5

df[!duplicated(df$x),]
  x     y
1 1 test1
2 2 test2
4 3 test4

library(data.table)

dt[dt[,list(keep=.I[which.min(.I)]),by=.(x)][,keep]]

setkey(dt,x)
dt[J(unique(x)),mult="first"]


   x     y
1: 1 test1
2: 2 test2
3: 3 test4

dt[dt[,list(keep=.I[which.max(.I)]),by=.(x)][,keep]]

setkey(dt,x)
dt[J(unique(x)),mult="last"]



   x     y
1: 1 test1
2: 2 test3
3: 3 test5

dt <- structure(list(x = c(1L, 2L, 2L, 3L, 3L), y = structure(1:5, .Label = c("test1", 
"test2", "test3", "test4", "test5"), class = "factor")), .Names = c("x", 
"y"), class = c("data.table", "data.frame"), row.names = c(NA, 
-5L), .internal.selfref = <pointer: 0x0000000000140788>)