R 计算每个类别列的发生次数
我试图计算每个SNP名称在iets列中的“Opp”发生量(最终我想将“Opp”发生量除以df$MM) 我一直在尝试类似的方法,但是我似乎不知道如何将发生值分配给我的oppcount/percentage列。R 计算每个类别列的发生次数,r,count,data.table,R,Count,Data.table,我试图计算每个SNP名称在iets列中的“Opp”发生量(最终我想将“Opp”发生量除以df$MM) 我一直在尝试类似的方法,但是我似乎不知道如何将发生值分配给我的oppcount/percentage列。 首先,我必须计算每个SNP的“Opp”数量,然后除以MM as.character((sum(df$iets == "Opp")/(df[,.N, by = df$SNP][[2]]))) #[1] "0.5" "2" 如何计算每个SNP(类别)的“Opp”发生量?rs8.oppcou
首先,我必须计算每个SNP的“Opp”数量,然后除以MM
as.character((sum(df$iets == "Opp")/(df[,.N, by = df$SNP][[2]])))
#[1] "0.5" "2"
如何计算每个SNP(类别)的“Opp”发生量?
rs8.oppcount使用dplyr
怎么样
library('dplyr')
df %>% group_by(iets, SNP) %>% summarize(count=sum(count)) %>% filter(iets=='Opp')
您可以通过使用:=
运算符引用来更新数据。表
。与:
df[, `:=` (oppcount = sum(iets=='Opp'), percentage = sum(iets=='Opp')/.N), by = SNP]
你会得到:
> df
SNP FID IID NEW OLD count MM iets oppcount percentage
1: rs80932150 116601888 116601888 T/T C/C 1 4 Opp 2 0.5
2: rs80932150 116621563 116621563 T/C C/C 1 4 Het 2 0.5
3: rs80932150 117253533 117253533 T/T C/C 1 4 Opp 2 0.5
4: rs000001 118635095 118635095 T/C C/C 1 1 Het 0 0.0
5: rs80932150 118943247 118943247 T/C C/C 1 4 Het 2 0.5
或者,根据@Frank在评论中的建议,您也可以使用以下两个选项之一:
# method 1
df[, c('oppcount', 'percentage') := {s = sum(iets=='Opp'); .(s, s/.N)}, by = SNP]
# method 2
df[df[, {s = sum(iets=='Opp'); .(oppcount = s, percentage = s/.N)}, by = SNP], on = 'SNP']
一个基本的R替代方案:
transform(df,
oppcount = ave(iets, SNP, FUN = function(x) sum(x=='Opp')),
percentage = ave(iets, SNP, FUN = function(x) sum(x=='Opp')/length(x)))
正确的dplyr
备选方案是:
library(dplyr)
df %>%
group_by(SNP) %>%
mutate(oppcount = sum(iets=='Opp'),
percentage = oppcount/n())
这是第一步,但我的实际数据集包含约10万行,共有约20万个不同的SNP。我正在寻找一个无论SNP名称如何都能正常工作的解决方案。我希望每个SNP组的重新排序不重要。幸运的是,不重要,>df2 DF1重新排序SNP组不重要,我可以简单地match()
将它们恢复到它们所属的位置!仅供参考,计算条件适用的实例的标准方法是求和(cond(x))
而不是长度(x[cond(x)]),如公认答案所示。我认为Dplyr还有其他一些方便的函数,在这个例子中,它只生成一行,不包括只有“Het”的rs id。
> df
SNP FID IID NEW OLD count MM iets oppcount percentage
1: rs80932150 116601888 116601888 T/T C/C 1 4 Opp 2 0.5
2: rs80932150 116621563 116621563 T/C C/C 1 4 Het 2 0.5
3: rs80932150 117253533 117253533 T/T C/C 1 4 Opp 2 0.5
4: rs000001 118635095 118635095 T/C C/C 1 1 Het 0 0.0
5: rs80932150 118943247 118943247 T/C C/C 1 4 Het 2 0.5
# method 1
df[, c('oppcount', 'percentage') := {s = sum(iets=='Opp'); .(s, s/.N)}, by = SNP]
# method 2
df[df[, {s = sum(iets=='Opp'); .(oppcount = s, percentage = s/.N)}, by = SNP], on = 'SNP']
transform(df,
oppcount = ave(iets, SNP, FUN = function(x) sum(x=='Opp')),
percentage = ave(iets, SNP, FUN = function(x) sum(x=='Opp')/length(x)))
library(dplyr)
df %>%
group_by(SNP) %>%
mutate(oppcount = sum(iets=='Opp'),
percentage = oppcount/n())