R 对载体中的重复序列进行排序和评估
我试图创建一个变量,用于标识向量中的字符串是第一次出现,是在前三位,还是多于三位。例如: 在下面的数据集中,我有name(将会有更多的名称)、text和一个dup变量。我希望dup变量能够识别文本是否是第一次出现(原点),是否在前三次出现(前三次)内,或者是否出现超过三次(超过三次)。我也需要为每个人做这件事。。。但我想我能理解这一部分。提前感谢您的帮助R 对载体中的重复序列进行排序和评估,r,R,我试图创建一个变量,用于标识向量中的字符串是第一次出现,是在前三位,还是多于三位。例如: 在下面的数据集中,我有name(将会有更多的名称)、text和一个dup变量。我希望dup变量能够识别文本是否是第一次出现(原点),是否在前三次出现(前三次)内,或者是否出现超过三次(超过三次)。我也需要为每个人做这件事。。。但我想我能理解这一部分。提前感谢您的帮助 name =c("T","T","T","T","T","T","T","T","T","T") text =c("a","b","a","a
name =c("T","T","T","T","T","T","T","T","T","T")
text =c("a","b","a","a","b","c","a","a","b","a")
dup =c("origin","origin","FirstThree","FirstThree","FirstThree","origin","MoreThanThree","MoreThanThree","FirstThree","MoreThanThree")
dfA = data.frame(name,text,dup)
name text dup
1 T a origin
2 T b origin
3 T a FirstThree
4 T a FirstThree
5 T b FirstThree
6 T c origin
7 T a MoreThenThree
8 T a MoreThenThree
9 T b FirstThree
10 T a MoreThenThree
您可以将
data.table::rowid
与两个ifelse
检查一起使用
dfA[, ict := {
r <- rowid(text)
ifelse(r == 1, 'origin',
ifelse(r <= 3, 'FirstThree',
'MoreThanThree'))}
, by = name]
dfA
# name text dup ict
# 1: T a origin origin
# 2: T b origin origin
# 3: T a FirstThree FirstThree
# 4: T a FirstThree FirstThree
# 5: T b FirstThree FirstThree
# 6: T c origin origin
# 7: T a MoreThanThree MoreThanThree
# 8: T a MoreThanThree MoreThanThree
# 9: T b FirstThree FirstThree
# 10: T a MoreThanThree MoreThanThree
在
dplyr
中,我们可以在case>语句中比较行数()
library(dplyr)
dfA %>%
group_by(text) %>%
mutate(row = row_number(),
dup = case_when(row == 1 ~ "origin",
row <= 3 ~ "FirstThree",
TRUE ~ "MoreThenThree"))
# name text row dup
# <fct> <fct> <int> <chr>
# 1 T a 1 origin
# 2 T b 1 origin
# 3 T a 2 FirstThree
# 4 T a 3 FirstThree
# 5 T b 2 FirstThree
# 6 T c 1 origin
# 7 T a 4 MoreThenThree
# 8 T a 5 MoreThenThree
# 9 T b 3 FirstThree
#10 T a 6 MoreThenThree
库(dplyr)
dfA%>%
分组依据(文本)%>%
变异(行=行编号(),
当(行==1~“原点”,
row Nice!我不知道rowid,仍然会使用seq_len(.N)
和by=(name,text)
来实现这个目的,或者使用base R
和(dfA,cut(ave(seq_-along(text),text,name,FUN=seq_-along),c(0,1,3,Inf),labels=c('origin','FirstThree','moretree'))
不知道ave的用法。我用数据得到了相同的解决方案。表:dt[,dup\u cut:=cut(x=1.N,breaks=c(0,1,3,Inf),include.lost=t,labels=c(“origin”,“FirstThree”,“MoreThanThree”)),by=(name,text)
library(dplyr)
dfA %>%
group_by(text) %>%
mutate(row = row_number(),
dup = case_when(row == 1 ~ "origin",
row <= 3 ~ "FirstThree",
TRUE ~ "MoreThenThree"))
# name text row dup
# <fct> <fct> <int> <chr>
# 1 T a 1 origin
# 2 T b 1 origin
# 3 T a 2 FirstThree
# 4 T a 3 FirstThree
# 5 T b 2 FirstThree
# 6 T c 1 origin
# 7 T a 4 MoreThenThree
# 8 T a 5 MoreThenThree
# 9 T b 3 FirstThree
#10 T a 6 MoreThenThree