R：如何透视和统计data.frame（例如：医疗条件列表和每个患者的数量）_R_Statistics_Dplyr_Analysis_Tidyr

R：如何透视和统计data.frame（例如：医疗条件列表和每个患者的数量）

r statistics

R：如何透视和统计data.frame（例如：医疗条件列表和每个患者的数量）,r,statistics,dplyr,analysis,tidyr,R,Statistics,Dplyr,Analysis,Tidyr,我正试图通过dplyr和tidyr变得更好，但我不习惯“用R思考”。最好举个例子。我从sql中的数据生成的表如下所示： ╔═══════════╦════════════╦═════╦════════╦══════════════╦══════════╦══════════════╗ ║ patientid ║ had_stroke ║ age ║ gender ║ hypertension ║ diabetes ║ estrogen HRT ║ ╠═══════════╬═══════════

我正试图通过dplyr和tidyr变得更好，但我不习惯“用R思考”。最好举个例子。我从sql中的数据生成的表如下所示：

╔═══════════╦════════════╦═════╦════════╦══════════════╦══════════╦══════════════╗ ║ patientid ║ had_stroke ║ age ║ gender ║ hypertension ║ diabetes ║ estrogen HRT ║ ╠═══════════╬════════════╬═════╬════════╬══════════════╬══════════╬══════════════╣ ║ 934988 ║ 1 ║ 65 ║ M ║ 1 ║ 1 ║ 0 ║ ║ 94044 ║ 0 ║ 69 ║ F ║ 1 ║ 0 ║ 0 ║ ║ 689348 ║ 0 ║ 56 ║ F ║ 0 ║ 1 ║ 1 ║ ║ 902498 ║ 1 ║ 45 ║ M ║ 0 ║ 0 ║ 1 ║ ║ … ║ ║ ║ ║ ║ ║ ║ ╚═══════════╩════════════╩═════╩════════╩══════════════╩══════════╩══════════════╝ ╔═══════════╦════════════╦═════╦════════╦══════════════╦══════════╦══════════════╗ ║ 病人║ 中风║ 年龄║ 性别║ 高血压║ 糖尿病║ 雌激素激素替代疗法║ ╠═══════════╬════════════╬═════╬════════╬══════════════╬══════════╬══════════════╣ ║ 934988║ 1.║ 65║ M║ 1.║ 1.║ 0║ ║ 94044║ 0║ 69║ F║ 1.║ 0║ 0║ ║ 689348║ 0║ 56║ F║ 0║ 1.║ 1.║ ║ 902498║ 1.║ 45║ M║ 0║ 0║ 1.║ ║ … ║ ║ ║ ║ ║ ║ ║ ╚═══════════╩════════════╩═════╩════════╩══════════════╩══════════╩══════════════╝ 我想创建一个输出表，其中包含以下信息：

╔══════════════╦════════╦══════════╦══════════╦══════════╦═══════════╗ ║ ║ total ║M lt50 yo ║F lt50 yo ║M gte50yo ║F gte 50yo ║ ╠══════════════╬════════╬══════════╬══════════╬══════════╬═══════════╣ ║ estrogen HRT ║ 347 ║ 2 ║ 65 ║ 4 ║ 97 ║ ║ diabetes ║ 13922 ║ 54 ║ 73 ║ 192 ║ 247 ║ ║ hypertension ║ 8210 ║ 102 ║ 187 ║ 443 ║ 574 ║ ╚══════════════╩════════╩══════════╩══════════╩══════════╩═══════════╝ ╔══════════════╦════════╦══════════╦══════════╦══════════╦═══════════╗ ║ ║ 全部的║M lt50 yo║F lt50 yo║M gte50yo║F gte 50yo║ ╠══════════════╬════════╬══════════╬══════════╬══════════╬═══════════╣ ║ 雌激素激素替代疗法║ 347║ 2.║ 65║ 4.║ 97║ ║ 糖尿病║ 13922║ 54║ 73║ 192║ 247║ ║ 高血压║ 8210║ 102║ 187║ 443║ 574║ ╚══════════════╩════════╩══════════╩══════════╩══════════╩═══════════╝ Total是具有该共病的患者总数（很简单：sum（数据$雌激素==1）等）。其他细胞现在是在该年龄和性别分层中患有该共病的患者数量，其中卒中=1

我想大致了解一下如何处理这样的问题，因为这似乎是转换数据的基本方法。如果total列让它很时髦，那么可以随意排除它。

尝试做得更简单一些

我假设您有一个名为

data

的

data.frame

。这是一个玩具数据集

set.seed(0)
data <- data.frame(estrogen = runif(100) < .10,
               diabetes = runif(100) < .15,
               hypertension = runif(100) < .20,
               groups = cut(runif(100), c(0,.1,.4,.7,1), labels = c("my", "fy", "mo", "fo")))

最后，使用

colnames（res）

rownames（res）

为列和行设置适当的名称

colnames(res)[1] <- "Total"
rownames(res) <- c("estrogen", "diabetes", "hypertension")

这是一个data.table解决方案

# create MRE - you have this already
n  <- 1000
set.seed(1)     # for reproducible example
df <- data.frame(ID=sample(1:n,n),had_stroke=sample(0:1,n,replace=TRUE),
                age=sample(25:85,n,replace=TRUE), gender=sample(c("M","F"),n,replace=TRUE),
                hypertension=sample(0:1,n,replace=TRUE),
                diabetes=sample(0:1,n,replace=TRUE),
                estrogen=sample(0:1,n,replace=TRUE))

# you start here.
library(data.table)
result <- melt(setDT(df),measure=5:7, variable.name="comorbidity")
result[,list(total=sum(value==1), 
             M.lt.50=sum(value[gender=="M"&age< 50]),
             F.lt.50=sum(value[gender=="F"&age< 50]),
             M.ge.50=sum(value[gender=="M"&age>=50]),
             F.ge.50=sum(value[gender=="F"&age>=50])),
       by=comorbidity]

#     comorbidity total M.lt.50 F.lt.50 M.ge.50 F.ge.50
# 1: hypertension   521     104     126     143     148
# 2:     diabetes   482     109     120     125     128
# 3:     estrogen   492      99     126     119     148

#创建MRE-您已经有了它
n您期望的输出似乎与提供的数据不匹配，因此不太清楚您想要实现什么。非常感谢您的帮助！我想我对dplyr和tidyr的想法太多了
             Total my fy mo fo
estrogen        12  2  2  4  4
diabetes        28  1  8 11  8
hypertension    27  1 10 11  5

# create MRE - you have this already
n  <- 1000
set.seed(1)     # for reproducible example
df <- data.frame(ID=sample(1:n,n),had_stroke=sample(0:1,n,replace=TRUE),
                age=sample(25:85,n,replace=TRUE), gender=sample(c("M","F"),n,replace=TRUE),
                hypertension=sample(0:1,n,replace=TRUE),
                diabetes=sample(0:1,n,replace=TRUE),
                estrogen=sample(0:1,n,replace=TRUE))

# you start here.
library(data.table)
result <- melt(setDT(df),measure=5:7, variable.name="comorbidity")
result[,list(total=sum(value==1), 
             M.lt.50=sum(value[gender=="M"&age< 50]),
             F.lt.50=sum(value[gender=="F"&age< 50]),
             M.ge.50=sum(value[gender=="M"&age>=50]),
             F.ge.50=sum(value[gender=="F"&age>=50])),
       by=comorbidity]

#     comorbidity total M.lt.50 F.lt.50 M.ge.50 F.ge.50
# 1: hypertension   521     104     126     143     148
# 2:     diabetes   482     109     120     125     128
# 3:     estrogen   492      99     126     119     148