R 将多组测量列(宽格式)重塑为单列(长格式)
我有一个宽格式的数据框,在不同的日期范围内重复测量。在我的示例中,有三个不同的时段,都有相应的值。例如,第一次测量(R 将多组测量列(宽格式)重塑为单列(长格式),r,reshape,tidyr,reshape2,r-faq,R,Reshape,Tidyr,Reshape2,R Faq,我有一个宽格式的数据框,在不同的日期范围内重复测量。在我的示例中,有三个不同的时段,都有相应的值。例如,第一次测量(Value1)是在DateRange1Start到DateRange1End期间测量的: ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3 1 1/1/90 3/1/90 4.4 4/5/91 6/7/91
Value1
)是在DateRange1Start
到DateRange1End
期间测量的:
ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
我希望将数据重塑为长格式,以便将DateRangeXStart和DateRangeXEnd列分组,。因此,原始表中的1行变为新表中的3行:
ID DateRangeStart DateRangeEnd Value
1 1/1/90 3/1/90 4.4
1 4/5/91 6/7/91 6.2
1 5/5/95 6/6/96 3.3
我知道一定有办法用
重塑2
/熔化
/重铸
/tidyr
,但我似乎不知道如何以这种特殊的方式将多组度量变量映射到单组值列;基本R
功能即可
a <- read.table(textConnection("
ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
"),header=TRUE)
b1 <- a[,c(1:4)]; b2 <- a[,c(1,5:7)]; b3 <- a[,c(1,8:10)]
colnames(b1) <- colnames(b2) <- colnames(b3) <- c("ID","DateRangeStart","DateRangeEnd","Value")
b <- rbind(b1,b2,b3)
a
(根据Josh的建议添加了v.names。)以下是使用tidyr
解决问题的方法。这是其函数extract\u numeric()
的一个有趣的用例,我用它从列名中提取组
library(dplyr)
library(tidyr)
a <- read.table(textConnection("
ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
"),header=TRUE)
a %>%
gather(variable,value,-ID) %>%
mutate(group = extract_numeric(variable)) %>%
mutate(variable = gsub("\\d","",x = variable)) %>%
spread(variable,value)
ID group DateRangeEnd DateRangeStart Value
1 1 1 3/1/90 1/1/90 4.4
2 1 2 6/7/91 4/5/91 6.2
3 1 3 6/6/96 5/5/95 3.3
库(dplyr)
图书馆(tidyr)
a%
聚集(变量,值,-ID)%%>%
变异(组=提取\数值(变量))%>%
突变(变量=gsub(“\\d”,“x=variable))%>%
排列(变量、值)
ID组DateRangeEnd DateRangeStart值
1 1 1 3/1/90 1/1/90 4.4
2 1 2 6/7/91 4/5/91 6.2
3 1 3 6/6/96 5/5/95 3.3
数据。表
的
熔化函数可以熔化为多列。利用这一点,我们可以简单地做到:
require(data.table)
melt(setDT(dat), id=1L,
measure=patterns("Start$", "End$", "^Value"),
value.name=c("DateRangeStart", "DateRangeEnd", "Value"))
# ID variable DateRangeStart DateRangeEnd Value
# 1: 1 1 1/1/90 3/1/90 4.4
# 2: 1 2 4/5/91 6/7/91 6.2
# 3: 1 3 5/5/95 6/6/96 3.3
或者,也可以通过列位置引用三组度量列:
melt(setDT(dat), id = 1L,
measure = list(c(2,5,8), c(3,6,9), c(4,7,10)),
value.name = c("DateRangeStart", "DateRangeEnd", "Value"))
另外两个选项(使用一个具有多行的示例数据帧,以更好地显示代码的工作情况):
1)带基数R:
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)
library(dplyr)
library(purrr)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group',
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))[,-2]
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
group = rep(seq_along(l), each = nrow(d)),
do.call(rbind, l), row.names = NULL)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))
d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)
2)使用tidyverse
:
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)
library(dplyr)
library(purrr)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group',
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))[,-2]
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
group = rep(seq_along(l), each = nrow(d)),
do.call(rbind, l), row.names = NULL)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))
d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)
3)使用sjmisc
-包:
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)
library(dplyr)
library(purrr)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group',
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))[,-2]
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
group = rep(seq_along(l), each = nrow(d)),
do.call(rbind, l), row.names = NULL)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))
d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)
如果您还需要“组/时间”列,可以将上述方法调整为: 1)带基数R:
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)
library(dplyr)
library(purrr)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group',
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))[,-2]
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
group = rep(seq_along(l), each = nrow(d)),
do.call(rbind, l), row.names = NULL)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))
d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)
2)使用tidyverse
:
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)
library(dplyr)
library(purrr)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group',
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))[,-2]
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
group = rep(seq_along(l), each = nrow(d)),
do.call(rbind, l), row.names = NULL)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))
d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)
3)使用sjmisc
-包:
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)
library(dplyr)
library(purrr)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group',
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))[,-2]
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
group = rep(seq_along(l), each = nrow(d)),
do.call(rbind, l), row.names = NULL)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))
d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)
已用数据:
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)
library(dplyr)
library(purrr)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group',
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))[,-2]
l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
group = rep(seq_along(l), each = nrow(d)),
do.call(rbind, l), row.names = NULL)
split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>%
bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)
library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
values = c('DateRangeStart','DateRangeEnd','Value'),
c('DateRange1Start','DateRange2Start','DateRange3Start'),
c('DateRange1End','DateRange2End','DateRange3End'),
c('Value1','Value2','Value3'))
d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)
d使用回收:
data.frame(ID = d[, 1],
DateRangeStart = unlist(d[, -1][, c(TRUE, FALSE, FALSE)]),
DateRangeEnd = unlist(d[, -1][, c(FALSE, TRUE, FALSE)]),
Value = unlist(d[, -1][, c(FALSE, FALSE, TRUE)]))
从1.0.0版起,使用tidyr软件包的函数pivot\u longer()
可以将具有多个值/度量值列的宽格式重塑为长格式
这比之前的tidyr策略gather()
优于spread()
(请参见@AndrewMacDonald的答案),因为属性不再被删除(在下面的示例中,日期仍然是日期,数字仍然是数字)
library(“tidyr”)
图书馆(“magrittr”)
[4]“值\u 1”“日期范围开始\u 2”“日期范围结束\u 2”
#>[7]“值\u 2”“日期范围开始\u 3”“日期范围结束\u 3”
#>[10]“价值3”
枢轴_更长(a,
cols=-ID,
名称_to=c(“.value”,“group”),
#名称\u prefix=“DateRange”,
名称\u sep=“\u”)
#>#tibble:3 x 5
#>ID组DateRangeEnd DateRangeStart值
#>
#> 1 1 1 1990-01-03 1990-01-01 4.4
#> 2 1 2 1991-07-06 1991-05-04 6.2
#> 3 1 3 1996-06-06 1995-05-05 3.3
或者,可以使用提供更精细控制的轴心规范进行重塑(请参见下面的链接):
spec%
构建\u更长\u规范(cols=-ID)%>%
dplyr::transmute(.name=.name,
group=readr::parse_编号(名称),
.value=stringr::str|u extract(名称,“开始|结束|值”))
枢轴长度(a,等级=等级)
由(v0.2.1)于2019-03-26创建
另请参见:+1以展示variable=
参数的强大功能。接下来,v.names
参数还可以修饰这些列名,如下所示:v.names=c(“DateRangeStart”、“DateRangeEnd”、“Value”)
一般来说,您可能希望将来有一个更好的命名模式。例如,使用“DateRangeStart1”、“DateRangeEnd1”、“Value1”(换言之,“VariableMeasurement”)比将测量值固定在变量名称的某个位置更容易/更清晰。答案必须使用重塑2/melt/recast/tidyr
?(这个问题是一个更好、更一般的dupe目标,如果不是的话)这实际上是对一个稍微不同的问题的回答,即如何使用整洁的方法避免属性丢失。最初被接受的答案(使用stats::restrape
)从未出现过这个问题。而最初的问题显然也没有日期分类变量。重塑函数保留了因子级别和日期类。我完全同意您的stats::restrape()
solution(+1)同样出色。正则表达式可以简化为名称(a)