R 为数据帧每组中的行创建序列号（计数器）_R_Dataframe

R 为数据帧每组中的行创建序列号（计数器）

r dataframe

R 为数据帧每组中的行创建序列号（计数器）,r,dataframe,R,Dataframe,我们如何在数据帧的每组中生成唯一的id号？以下是一些按“personid”分组的数据：我希望为“personid”定义的每个子集中的每一行添加一个id列，该列的值都是唯一的，始终以1开头。这是我想要的输出： personid date measurement id 1 x 23 1 1 x 32 2 2 y 21 1 3 x 23 1

我们如何在数据帧的每组中生成唯一的id号？以下是一些按“personid”分组的数据：

我希望为“personid”定义的每个子集中的每一行添加一个id列，该列的值都是唯一的，始终以

开头。这是我想要的输出：

personid date measurement id
1         x     23         1
1         x     32         2
2         y     21         1
3         x     23         1
3         z     23         2
3         y     23         3

非常感谢您的帮助。

我想这里有一个固定的命令，但我记不起来了。所以这里有一个方法：

> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
 [1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
 [1] 1 1 2 2 3 4 5 6 7 8

假设您的数据位于名为

data

的data.frame中，这将实现以下功能：

# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))

#确保数据顺序正确
Data名称错误的ave（）
函数，带有参数FUN=seq\u
，将很好地完成这一任务——即使您的personid
列没有严格的顺序
df <- read.table(text = "personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23", header=TRUE)

## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3

## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2

df使用data.table
，并假设您希望在personid
子集内按date
订购
library(data.table)
DT <- data.table(Data)

DT[,id := order(date), by  = personid]

##    personid date measurement id
## 1:        1    x          23  1
## 2:        1    x          32  2
## 3:        2    y          21  1
## 4:        3    x          23  1
## 5:        3    z          23  3
## 6:        3    y          23  2

以下任何一项都可以
DT[, id := seq_along(measurement), by =  personid]
DT[, id := seq_along(date), by =  personid]

使用plyr

library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))

您可以使用sqldf

df<-read.table(header=T,text="personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23")

library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
       FROM df a, df b 
       WHERE a.personid = b.personid AND b.ROWID <= a.ROWID 
       GROUP BY a.ROWID"
)

#  personid date measurement count
#1        1    x          23     1
#2        1    x          32     2
#3        2    y          21     1
#4        3    x          23     1
#5        3    z          23     2
#6        3    y          23     3

df一些dplyr
替代方案，使用方便的功能row\u number
和n

library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))


您也可以从packagesplitstackshape
中使用getanID
。请注意，输入数据集作为数据表返回
getanID(data = df, id.vars = "personid")
#    personid date measurement .id
# 1:        1    x          23   1
# 2:        1    x          32   2
# 3:        2    y          21   1
# 4:        3    x          23   1
# 5:        3    z          23   2
# 6:        3    y          23   3

dplyr
解决方案很好。但是，如果像我一样，您在尝试这种方法时不断遇到奇怪的错误，请确保您没有遇到plyr
和dplyr
之间的冲突，正如所解释的那样，可以通过显式调用dplyr:：mutate（…）
@EcologyTom来避免。你是对的，我很惊讶我以前没有遇到过这种问题。我没有意识到调用了plyr，它可能是几天前加载的。谢谢你的回答！
df<-read.table(header=T,text="personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23")

library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
       FROM df a, df b 
       WHERE a.personid = b.personid AND b.ROWID <= a.ROWID 
       GROUP BY a.ROWID"
)

#  personid date measurement count
#1        1    x          23     1
#2        1    x          32     2
#3        2    y          21     1
#4        3    x          23     1
#5        3    z          23     2
#6        3    y          23     3

library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))

getanID(data = df, id.vars = "personid")
#    personid date measurement .id
# 1:        1    x          23   1
# 2:        1    x          32   2
# 3:        2    y          21   1
# 4:        3    x          23   1
# 5:        3    z          23   2
# 6:        3    y          23   3