计算一种疾病在R_R_Dplyr_Tidyverse

计算一种疾病在R

计算一种疾病在R,r,dplyr,tidyverse,R,Dplyr,Tidyverse,我试图计算一种疾病（比如心肌梗死（MI）“心脏病发作”）的首次发生率，但我很难在R（base或tidyverse）中实现这一点。感谢您的帮助谢谢大家。这很有效。我意识到我的例子不清楚。这些方法总体上效果很好，但我想找到一种方法按时间段划分的发病率和患病率。发病率是在特定时间发生的新病例的比例除以未感染该疾病的人数 n_id <- 5 # five individuals n_time <- 4 # four time pints id <- rep(1:n_id, e

我试图计算一种疾病（比如心肌梗死（MI）“心脏病发作”）的首次发生率，但我很难在R（base或tidyverse）中实现这一点。感谢您的帮助

谢谢大家。这很有效。我意识到我的例子不清楚。这些方法总体上效果很好，但我想找到一种方法按时间段划分的发病率和患病率。发病率是在特定时间发生的新病例的比例除以未感染该疾病的人数


n_id <- 5 # five individuals
n_time <- 4 # four time pints
id <- rep(1:n_id, each = n_time)
time <- rep(1:n_time,times = n_id)
MI <- c(0,0,1,1,
        0,1,1,1,
        0,0,0,1,
        0,0,0,0,
        0,0,0,0)
dsn <- data.frame(id, time, MI)
MI2 <- c(0,0,1,NA,
         0,1,NA,NA,
         0,0,0,1,
         0,0,0,0,
         0,0,0,0)
dsn2 <- data.frame(id, time, MI, MI2)
library(dplyr)
arrange(dsn2, time)
dsn2

#>    id time MI MI2
#> 1   1    1  0   0
#> 2   2    1  0   0
#> 3   3    1  0   0
#> 4   4    1  0   0
#> 5   5    1  0   0
#> 6   1    2  0   0
#> 7   2    2  1   1
#> 8   3    2  0   0
#> 9   4    2  0   0
#> 10  5    2  0   0
#> 11  1    3  1   1
#> 12  2    3  1  NA
#> 13  3    3  0   0
#> 14  4    3  0   0
#> 15  5    3  0   0
#> 16  1    4  1  NA
#> 17  2    4  1  NA
#> 18  3    4  1   1
#> 19  4    4  0   0
#> 20  5    4  0   0

#in the example above, it can be calculated as below
#For the incidence at each time point (proportion of new cases that occur at a particular time divided by the number of people who did not get the disease)
#time 1 = 0/5 =0
#time 2 = 1/5 =0.2
#time 3 = 1/4 =0.25
#time 4 = 1/3 =0.33

##For the prevalence at each time point (the proportion of new and old cases divided by total population)
#time 1 = 0/5 =0
#time 2 = 1/5 =0.2
#time 3 = 2/5 =0.4
#time 4 = 3/5 =0.6

time <- 1:4
incidence <- c(0/5, 1/5, 1/4, 1/3)
prevalence <- c(0/5, 1/5, 2/5, 3/5)

results <- cbind(time, incidence, prevalence)
results
#>      time incidence prevalence
#> [1,]    1 0.0000000        0.0
#> [2,]    2 0.2000000        0.2
#> [3,]    3 0.2500000        0.4
#> [4,]    4 0.3333333        0.6


n_id 7 2 1
#> 8   3    2  0   0
#> 9   4    2  0   0
#> 10  5    2  0   0
#> 11  1    3  1   1
#>12 2 3 1 NA
#> 13  3    3  0   0
#> 14  4    3  0   0
#> 15  5    3  0   0
#>16141NA
#>17241NA
#> 18  3    4  1   1
#> 19  4    4  0   0
#> 20  5    4  0   0
#在上面的例子中，它可以按如下方式计算
#每个时间点的发病率（特定时间发生的新病例比例除以未感染该疾病的人数）
#时间1=0/5=0
#时间2=1/5=0.2
#时间3=1/4=0.25
#时间4=1/3=0.33
##每个时间点的患病率（新旧病例比例除以总人口）
#时间1=0/5=0
#时间2=1/5=0.2
#时间3=2/5=0.4
#时间4=3/5=0.6
时间[3，]30.2500000.4
#> [4,]    4 0.3333333        0.6

我希望能够对每个时间点都这样做，并考虑在上一个时间点发生的事情。一个for循环会是一种方式吗？

非常感谢您的编辑，这里有一个计算发病率的解决方案。如果疾病发生在时间1，它也会返回正确的结果

library(dplyr) 

dsn %>%
  group_by(id) %>%
  mutate(neg = MI == 1 & !duplicated(MI)) %>%
  group_by(time) %>%
  summarise(d = sum(MI != 1),
            prevalence = mean(MI),
            n = sum(neg)) %>%
  transmute(time, 
            incidence = n / lag(d, default = n_distinct(dsn$id)),
            prevalence)

   time incidence prevalence
  <int>     <dbl>      <dbl>
1     1     0            0  
2     2     0.2          0.2
3     3     0.25         0.4
4     4     0.333        0.6

库（dplyr）
dsn%>%
分组依据（id）%>%
变异（neg=MI==1&！重复（MI））%>%
分组单位（时间）%>%
总结（d=总和（MI！=1），
患病率=平均值（MI），
n=总和（负））%>%
转化（时间，
发生率=n/滞后（d，默认值=n_不同（dsn$id）），
流行率）
时间发病率
1     1     0            0  
2     2     0.2          0.2
3     3     0.25         0.4
4     4     0.333        0.6

您可以为每个

id

筛选最后一行，然后计算比例

library(dplyr)

dsn2 <- dsn %>%
  group_by(id) %>%
  slice(n())

sum(dsn2$MI)/nrow(dsn2)
# [1] 0.6

库（dplyr）
dsn2%
分组依据（id）%>%
切片（n（））
总额（dsn2$MI）/nrow（dsn2）
# [1] 0.6

编辑后的问题比前一个问题更难解决。但是，这里有一个使用

tidyverse

的解决方案

library(tidyverse)

dsn2 %>%
  #Group by time
  group_by(time) %>%
  #Get the sum of positives and negatives, as well as total ID number
  summarize(pos = sum(MI ==1),
            neg = sum(MI ==0),
            totalID = n_distinct(id)) %>%
  #add lagged entry of positives
  mutate(poslag = lag(pos)) %>%
  #Replace NA (first row) with zero
  replace_na (list(poslag = 0)) %>%
  #Get the number of new cases using pos and poslag
  mutate(news = pos - poslag) %>%
  #Get incidence and prevalence
  mutate(incidence = news/neg,
         prevalence = pos/totalID) %>%
  #Stay only with the time, incidence and prevalence columns
  select(time, incidence, prevalence)

# A tibble: 4 x 3
#   time incidence prevalence
#  <int>     <dbl>      <dbl>
#1     1     0            0  
#2     2     0.25         0.2
#3     3     0.333        0.4
#4     4     0.5          0.6

库（tidyverse）
dsn2%>%
#按时间分组
分组单位（时间）%>%
#获取正数和负数之和，以及总ID号
汇总（pos=sum（MI=1），
负=和（MI==0），
totalID=n_distinct（id））%>%
#添加滞后的正项
突变（poslag=lag（pos））%>%
#将NA（第一行）替换为零
替换_na（列表（poslag=0））%>%
#使用pos和poslag获取新案例的数量
突变（news=pos-poslag）%>%
#了解发病率和流行率
突变（发病率=新闻/阴性，
患病率=pos/totalID）%>%
#只关注时间、发病率和患病率列
选择（时间、发病率、患病率）
#一个tibble:4x3
#时间发病率
#             
#1     1     0            0  
#2     2     0.25         0.2
#3     3     0.333        0.4
#4     4     0.5          0.6

关联值与您报告的值不同；然而，我认为它们的计算是错误的，因为在时间2中有1个新的阳性和4个阴性，所以发病率应该是1/4=0.25，这同样适用于以下时间。

谢谢你@Jonathan。事实上，对于发病率，你必须将新病例与没有疾病的人数分开。因此，在时间1（5-0=5人没有疾病）、时间2（5-0=5人没有疾病）、时间3（5-1=4人没有疾病）和时间4（4-1=3人没有疾病）。我修改了代码，得到了正确的代码。我不确定这是最有效的，但它在这里

library(tidyverse)
dsn2 %>%
  #Group by time
  group_by(time) %>%
  #Get the sum of positives and negatives, as well as total ID number
  summarise(pos = sum(MI ==1),
            neg = sum(MI ==0),
            totalID = n_distinct(id)) %>%
  #add lagged entry of positives
  mutate(poslag = lag(pos), 
         neglag = lag(neg)) %>%
  #Replace NA with zero in poslag and 1 in neglag (because of the division)
  mutate(poslag = case_when(is.na(poslag) ~ 0, TRUE ~ as.double(poslag)),
         neglag = case_when(is.na(neglag) ~ 1, TRUE ~ as.double(neglag))) %>%
  #Get the number of new cases using pos and poslag
  mutate(news = pos - poslag) %>%
  #Get incidence and prevalence
  mutate(incidence = news/neglag,
         prevalence = pos/totalID) %>%
  #Stay only with the time, incidence and prevalence columns
  select(time, incidence, prevalence)
#> # A tibble: 4 x 3
#>    time incidence prevalence
#>   <int>     <dbl>      <dbl>
#> 1     1     0            0  
#> 2     2     0.2          0.2
#> 3     3     0.25         0.4
#> 4     4     0.333        0.6

库（tidyverse）
dsn2%>%
#按时间分组
分组单位（时间）%>%
#获取正数和负数之和，以及总ID号
总结（pos=总和（MI=1），
负=和（MI==0），
totalID=n_distinct（id））%>%
#添加滞后的正项
突变（poslag=滞后（pos），
负滞后=滞后（负））%>%
#将NA替换为poslag中的零，neglag中的1（因为除法）
当（is.na（poslag）~0，TRUE~as.double（poslag）），mutate（poslag=case_，
neglag=当（is.na（neglag）~1，TRUE~as.double（neglag））%>%
#使用pos和poslag获取新案例的数量
突变（news=pos-poslag）%>%
#了解发病率和流行率
突变（发病率=新闻/负滞后，
患病率=pos/totalID）%>%
#只关注时间、发病率和患病率列
选择（时间、发病率、患病率）
#>#tibble:4 x 3
#>时间发病率
#>              
#> 1     1     0            0  
#> 2     2     0.2          0.2
#> 3     3     0.25         0.4
#> 4     4     0.333        0.6

这有意义吗？还有别的办法吗？非常感谢

谢谢@H1。为了清晰起见，我编辑了这个示例。关于每个时间点如何进行的任何建议？谢谢@www。为了清晰起见，我编辑/澄清了示例（见上文）。关于如何在每个时间点进行这项工作，有什么建议吗？谢谢@Jonathan。为了清晰起见，我编辑/澄清了示例（见上文）。对于如何在每个时间点进行治疗，有什么建议吗？如果在时间1发生该疾病，这不会返回正确的结果。我已经发布了一个解决方案。