R 用于重复行的函数
我有一个如下所示的数据帧:R 用于重复行的函数,r,R,我有一个如下所示的数据帧: > df pat_id disease [1,] "pat1" "dis1" [2,] "pat1" "dis1" [3,] "pat2" "dis0" [4,] "pat2" "dis5" [5,] "pat3" "dis2" [6,] "pat3" "dis2" 如何编写一个函数来获取第三个变量,该变量指示同一pat_id的疾病变量是否相同,如下所示 > df pat_id disease var3 [1,] "pat
> df
pat_id disease
[1,] "pat1" "dis1"
[2,] "pat1" "dis1"
[3,] "pat2" "dis0"
[4,] "pat2" "dis5"
[5,] "pat3" "dis2"
[6,] "pat3" "dis2"
如何编写一个函数来获取第三个变量,该变量指示同一pat_id的疾病变量是否相同,如下所示
> df
pat_id disease var3
[1,] "pat1" "dis1" "1"
[2,] "pat1" "dis1" "1"
[3,] "pat2" "dis0" "0"
[4,] "pat2" "dis5" "0"
[5,] "pat3" "dis2" "1"
[6,] "pat3" "dis2" "1"
尝试对分组使用ave()
,并将any(duplicated())
的结果用包装为.integer()
。然后使用cbind()
进行绑定。尽管我可能会建议您在这里使用数据帧而不是矩阵
cbind(
df,
var3 = ave(df[,2], df[,1], FUN = function(x) as.integer(any(duplicated(x)))
)
# pat_id disease var3
# [1,] "pat1" "dis1" "1"
# [2,] "pat1" "dis1" "1"
# [3,] "pat2" "dis0" "0"
# [4,] "pat2" "dis5" "0"
# [5,] "pat3" "dis2" "1"
# [6,] "pat3" "dis2" "1"
对于较大的数据,我建议转换为数据表。语法实际上也有点好,而且可能会更快
library(data.table)
dt <- as.data.table(df)
dt[, var3 := if(any(duplicated(disease))) 1 else 0, by = pat_id]
其中列类更合适(char、char、int)。或者您可以使用
作为.integer(any(duplicated(disease))
而不是if
/else
稍微冗长,但它提供了一个更容易测试的布尔第三个变量。它也不关心数据类型
> df <- data.frame(pat_id=c("pat1","pat1", "pat2", "pat2", "pat3", "pat3"),
+ disease=c("dis1","dis1","dis0","dis5","dis2","dis2"),
+ stringsAsFactors = F)
> counts<-apply(table(df), 1, function(x) sum(x!=0))
> df2<-data.frame(pat_id=names(counts), all_the_same=(counts==1))
> df3<-merge(df,df2)
> df3
pat_id disease all_the_same
1 pat1 dis1 TRUE
2 pat1 dis1 TRUE
3 pat2 dis0 FALSE
4 pat2 dis5 FALSE
5 pat3 dis2 TRUE
6 pat3 dis2 TRUE
> sapply(df3, class)
pat_id disease all_the_same
"character" "character" "logical"
使用dplyr的一个选项
library(dplyr)
as.data.frame(df) %>%
group_by(pat_id) %>%
mutate(var3 = as.integer(n_distinct(disease)==1))
# pat_id disease var3
# (chr) (chr) (int)
#1 pat1 dis1 1
#2 pat1 dis1 1
#3 pat2 dis0 0
#4 pat2 dis5 0
#5 pat3 dis2 1
#6 pat3 dis2 1
as.integer(duplicated(dat)| duplicated(dat,fromLast=TRUE))
可能有效,但您没有数据帧。这是一个矩阵。另外,as.integer(ave(df[,“disease”],df[,“pat_id”],FUN=anyDuplicated)>0)
作为主题的变体,我还有三行pat_id,如何修改语法?如何修改它以显示具有相同pat_id的3行或更多行的疾病是否相同?@trillian-这将适用于任何数量的相同pat_id
> unique(df3$pat_id[df3$all_the_same])
[1] "pat1" "pat3"
library(dplyr)
as.data.frame(df) %>%
group_by(pat_id) %>%
mutate(var3 = as.integer(n_distinct(disease)==1))
# pat_id disease var3
# (chr) (chr) (int)
#1 pat1 dis1 1
#2 pat1 dis1 1
#3 pat2 dis0 0
#4 pat2 dis5 0
#5 pat3 dis2 1
#6 pat3 dis2 1