R 计算每个类别列的发生次数

R 计算每个类别列的发生次数,r,count,data.table,R,Count,Data.table,我试图计算每个SNP名称在iets列中的“Opp”发生量(最终我想将“Opp”发生量除以df$MM) 我一直在尝试类似的方法,但是我似乎不知道如何将发生值分配给我的oppcount/percentage列。 首先,我必须计算每个SNP的“Opp”数量,然后除以MM as.character((sum(df$iets == "Opp")/(df[,.N, by = df$SNP][[2]]))) #[1] "0.5" "2" 如何计算每个SNP(类别)的“Opp”发生量?rs8.oppcou

我试图计算每个SNP名称在iets列中的“Opp”发生量(最终我想将“Opp”发生量除以df$MM)

我一直在尝试类似的方法,但是我似乎不知道如何将发生值分配给我的oppcount/percentage列。
首先,我必须计算每个SNP的“Opp”数量,然后除以MM

as.character((sum(df$iets == "Opp")/(df[,.N, by = df$SNP][[2]])))
#[1] "0.5" "2"  

如何计算每个SNP(类别)的“Opp”发生量?

rs8.oppcount使用
dplyr
怎么样

library('dplyr')
df %>% group_by(iets, SNP) %>% summarize(count=sum(count)) %>% filter(iets=='Opp')

您可以通过使用
:=
运算符引用来更新
数据。表
。与:

df[, `:=` (oppcount = sum(iets=='Opp'), percentage = sum(iets=='Opp')/.N), by = SNP]
你会得到:

> df
          SNP       FID       IID NEW OLD count MM iets oppcount percentage
1: rs80932150 116601888 116601888 T/T C/C     1  4  Opp        2        0.5
2: rs80932150 116621563 116621563 T/C C/C     1  4  Het        2        0.5
3: rs80932150 117253533 117253533 T/T C/C     1  4  Opp        2        0.5
4:   rs000001 118635095 118635095 T/C C/C     1  1  Het        0        0.0
5: rs80932150 118943247 118943247 T/C C/C     1  4  Het        2        0.5
或者,根据@Frank在评论中的建议,您也可以使用以下两个选项之一:

# method 1
df[, c('oppcount', 'percentage') := {s = sum(iets=='Opp'); .(s, s/.N)}, by = SNP]
# method 2
df[df[, {s = sum(iets=='Opp'); .(oppcount = s, percentage = s/.N)}, by = SNP], on = 'SNP']

一个基本的R替代方案:

transform(df,
          oppcount = ave(iets, SNP, FUN = function(x) sum(x=='Opp')),
          percentage = ave(iets, SNP, FUN = function(x) sum(x=='Opp')/length(x)))

正确的
dplyr
备选方案是:

library(dplyr)
df %>% 
  group_by(SNP) %>% 
  mutate(oppcount = sum(iets=='Opp'),
         percentage = oppcount/n())

这是第一步,但我的实际数据集包含约10万行,共有约20万个不同的SNP。我正在寻找一个无论SNP名称如何都能正常工作的解决方案。我希望每个SNP组的重新排序不重要。幸运的是,不重要,>df2 DF1重新排序SNP组不重要,我可以简单地
match()
将它们恢复到它们所属的位置!仅供参考,计算条件适用的实例的标准方法是求和(cond(x))
而不是长度(x[cond(x)]),如公认答案所示。我认为Dplyr还有其他一些方便的函数,在这个例子中,它只生成一行,不包括只有“Het”的rs id。
> df
          SNP       FID       IID NEW OLD count MM iets oppcount percentage
1: rs80932150 116601888 116601888 T/T C/C     1  4  Opp        2        0.5
2: rs80932150 116621563 116621563 T/C C/C     1  4  Het        2        0.5
3: rs80932150 117253533 117253533 T/T C/C     1  4  Opp        2        0.5
4:   rs000001 118635095 118635095 T/C C/C     1  1  Het        0        0.0
5: rs80932150 118943247 118943247 T/C C/C     1  4  Het        2        0.5
# method 1
df[, c('oppcount', 'percentage') := {s = sum(iets=='Opp'); .(s, s/.N)}, by = SNP]
# method 2
df[df[, {s = sum(iets=='Opp'); .(oppcount = s, percentage = s/.N)}, by = SNP], on = 'SNP']
transform(df,
          oppcount = ave(iets, SNP, FUN = function(x) sum(x=='Opp')),
          percentage = ave(iets, SNP, FUN = function(x) sum(x=='Opp')/length(x)))
library(dplyr)
df %>% 
  group_by(SNP) %>% 
  mutate(oppcount = sum(iets=='Opp'),
         percentage = oppcount/n())