Dataframe 基于一列重复数据和使用dplyr从长格式转换为宽格式的条件变异
我正在尝试将一些数据从长格式重新组织到宽格式 有很多人(MRN),每个人都做了不同次数的测序(seq_date),我想创建一个数据框,显示Val随时间的变化 我开始使用的数据帧如下所示:Dataframe 基于一列重复数据和使用dplyr从长格式转换为宽格式的条件变异,dataframe,dplyr,conditional-statements,mutate,spread,Dataframe,Dplyr,Conditional Statements,Mutate,Spread,我正在尝试将一些数据从长格式重新组织到宽格式 有很多人(MRN),每个人都做了不同次数的测序(seq_date),我想创建一个数据框,显示Val随时间的变化 我开始使用的数据帧如下所示: dat_data <- data.frame( MRN = c("012345", "012345", "012345", "012345", "012345", "012345"), seq_date = c("1-Aug-18", "27-Mar-19", "27-Mar-19", "27-M
dat_data <- data.frame(
MRN = c("012345", "012345", "012345", "012345", "012345", "012345"),
seq_date = c("1-Aug-18", "27-Mar-19", "27-Mar-19", "27-Mar-19", "7-May-19", "7-May-19"),
Gene = c("SRSF2", "TET2", "IDH1", "SRSF2", "IDH1", "SRSF2"),
AA = c("p.A2B", "p.C2D", "p.E2F", "p.A2B", "p.E2F", "p.A2B"),
Val = c("0.1", "0.2", "0.3", "0.4", "0.5", "0.6")
)
> dat_data
MRN seq_date Gene AA Val
1 012345 1-Aug-18 SRSF2 p.A2B 0.1
2 012345 27-Mar-19 TET2 p.C2D 0.2
3 012345 27-Mar-19 IDH1 p.E2F 0.3
4 012345 27-Mar-19 SRSF2 p.A2B 0.4
5 012345 7-May-19 IDH1 p.E2F 0.5
6 012345 7-May-19 SRSF2 p.A2B 0.6
MRN Gene AA D1 D2 D3
1 012345 SRSF2 p.A2B 0.1 0.4 0.6
2 012345 IDH1 p.E2F 0 0.3 0.5
3 012345 TET2 p.C2D 0 0.2 0
然后使用“聚集/扩散”创建宽格式数据帧,如下所示:
dat_data <- data.frame(
MRN = c("012345", "012345", "012345", "012345", "012345", "012345"),
seq_date = c("1-Aug-18", "27-Mar-19", "27-Mar-19", "27-Mar-19", "7-May-19", "7-May-19"),
Gene = c("SRSF2", "TET2", "IDH1", "SRSF2", "IDH1", "SRSF2"),
AA = c("p.A2B", "p.C2D", "p.E2F", "p.A2B", "p.E2F", "p.A2B"),
Val = c("0.1", "0.2", "0.3", "0.4", "0.5", "0.6")
)
> dat_data
MRN seq_date Gene AA Val
1 012345 1-Aug-18 SRSF2 p.A2B 0.1
2 012345 27-Mar-19 TET2 p.C2D 0.2
3 012345 27-Mar-19 IDH1 p.E2F 0.3
4 012345 27-Mar-19 SRSF2 p.A2B 0.4
5 012345 7-May-19 IDH1 p.E2F 0.5
6 012345 7-May-19 SRSF2 p.A2B 0.6
MRN Gene AA D1 D2 D3
1 012345 SRSF2 p.A2B 0.1 0.4 0.6
2 012345 IDH1 p.E2F 0 0.3 0.5
3 012345 TET2 p.C2D 0 0.2 0
我最熟悉的是w/dplyr,第一步尝试mutate=case_,第二步尝试收集/传播,但没有成功。非常感谢您的帮助。我按日期分组以确定订单,然后使用
pivot\u wide
从tidyr
进行传播
dat_data %>%
mutate(
sdate = lubridate::dmy(seq_date), # in case dates aren't in order
Val = as.numeric(as.character(Val)) # convert factor to numeric
) %>%
group_by(sdate) %>%
mutate(
ord_date = paste0('D',group_indices()) # Creates D1, D2, etc
) %>%
pivot_wider(
id_cols = c(MRN,Gene,AA),
names_from = ord_date,
values_from = Val,
values_fill = list(Val = 0) # fills missings with 0 instead of NA
)
# A tibble: 3 x 6
MRN Gene AA D1 D2 D3
<fct> <fct> <fct> <dbl> <dbl> <dbl>
1 012345 SRSF2 p.A2B 0.1 0.4 0.6
2 012345 TET2 p.C2D 0 0.2 0
3 012345 IDH1 p.E2F 0 0.3 0.5
dat\u数据%>%
变异(
sdate=lubridate::dmy(顺序日期),以防日期不符合顺序
Val=as.numeric(as.character(Val))#将因子转换为数值
) %>%
分组依据(sdate)%>%
变异(
ord_date=paste0('D',group_index())#创建D1、D2等
) %>%
支点更宽(
id_cols=c(MRN,基因,AA),
名称自=订单日期,
值_from=Val,
值_fill=list(Val=0)#用0而不是NA填充缺失
)
#一个tibble:3x6
MRN基因AA D1 D2 D3
1 012345 SRSF2 p.A2B 0.1 0.4 0.6
2 012345 TET2 p.C2D 0.2 0
3 012345 IDH1 p.E2F 0.3 0.5