R 在不排序列的情况下从宽转换为长
我想将数据帧从宽格式转换为长格式 这是一个玩具示例:R 在不排序列的情况下从宽转换为长,r,dataframe,data.table,transformation,R,Dataframe,Data.table,Transformation,我想将数据帧从宽格式转换为长格式 这是一个玩具示例: mydata <- data.frame(ID=1:5, ZA_1=1:5, ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2) ID ZA_1 ZA_2 BB_1 BB_2 CC_7 1 1 5 3 6 6 2 2 4 3 6 5 3 3 3 3 6 4 4 4 2
mydata <- data.frame(ID=1:5, ZA_1=1:5,
ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)
ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1 1 5 3 6 6
2 2 4 3 6 5
3 3 3 3 6 4
4 4 2 3 6 3
5 5 1 3 6 2
我不介意idvars列是否在开始时聚集在一起,或者它们是否也保持在原来的位置
IDZA_1ZA_2 TEMPBB_1BB_2 CC_2 CC_1
会是
ID ZA TEMP BB CC
或
我喜欢最后一个选择
另一个问题是,所有内容都转换为字符。如果将列名列表传递给参数
measure=
,则可以同时熔化多个列。以可扩展的方式实现这一点的一种方法是:
measurevars <- names(mydata)[grepl("_[1-9]$",names(mydata))]
groups <- gsub("_[1-9]$","",measurevars)
measurevars
和split()
创建一个列表,并为melt()
中的value.name=
参数创建向量
下面是一个使用基本R函数
split.default
和do.call
的方法
# split the non-ID variables into groups based on their name suffix
myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))
# append variables by row after setting the regularizing variable names, cbind ID
cbind(mydata[1],
do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
ID ZA BB
1.1 1 1 3
1.2 2 2 3
1.3 3 3 3
1.4 4 4 3
1.5 5 5 3
2.1 1 5 6
2.2 2 4 6
2.3 3 3 6
2.4 4 2 6
2.5 5 1 6
#根据名称后缀将非ID变量拆分为组
myList我终于找到了方法,修改了我的初始解决方案
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)
idvars = grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable),
measure = sub('.*_', '', variable), variable = NULL)]
temp[,var:=factor(var, levels=unique(var))]
dcast( temp, ... ~ var, value.var='value' )
mydataOP已经更新了他对自己问题的答案,抱怨中间melt()
步骤的内存消耗,其中一半列是id.vars
。他要求data.table
需要一种直接的方法来完成,而不需要创建巨大的中间步骤
嗯,data.table
已经具备了这种能力,它被称为join
给定来自Q的样本数据,通过仅使用一个id.var重新整形,然后将重新整形的结果与原始data.table连接,整个操作可以以更少内存消耗的方式实现:
setDT(mydata)
# add unique row number to join on later
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]
# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\\d$")
# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)
# split column names of measure.vars
# Note that "variable" is reused to save memory
molten[, c("variable", "measure") := tstrsplit(variable, "_")]
# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]
# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]
# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
# ID rn measure ZA BB CC
# 1: 1 1 1 1 3 NA
# 2: 1 1 2 5 6 NA
# 3: 1 1 7 NA NA 6
# 4: 2 2 1 2 3 NA
# 5: 2 2 2 4 6 NA
# 6: 2 2 7 NA NA 5
# 7: 3 3 1 3 3 NA
# 8: 3 3 2 3 6 NA
# 9: 3 3 7 NA NA 4
#10: 4 4 1 4 3 NA
#11: 4 4 2 2 6 NA
#12: 4 4 7 NA NA 3
#13: 5 5 1 5 3 NA
#14: 5 5 2 1 6 NA
#15: 5 5 7 NA NA 2
使用数据的替代方法。表
:
melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
][, variable := factor(variable, levels = unique(variable))
][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]
其中:
就我所见,当您手动指定变量名时,它们是简单的解决方案。但是我需要一种自动化的方法,使用grep,因为我的整个数据集有3500个变量。@Jaap这个问题不一样,因为我关注两个问题:我不希望输出被重新排序,我需要一个内存效率高的解决方案,而不仅仅是像你链接的另一个问题那样简单的解决方案,我的模式更复杂。Uwe的解决方案是一个很好的解决方案,一年前就已经被接受了。没有人在你的链接提供了一个好的解决方案,所以请删除这个问题的“重复问题”标签,因为它不是一个重复的。Thx的解释。我重新打开了这个问题,并发布了一个替代解决方案。我的整个数据集有3500个变量。这只是一个玩具的例子。这就是为什么我使用更复杂的方法来自动查找要熔化和转换的变量,使用grep。简单地说,它们都以x结尾,其中x是一个位数。如何避免将所有内容都转换为字符?我已经尝试了您的代码和度量值名称,我认为最好使用gsub(“[1-9]$,”,measurevars)我将尝试它,而不是分组。为了不使用额外的包,我可以使用grep(“\\d$”、name(mydata2)、value=T)和melt[,var:=factor(variable,levels=unique(variable))]来代替吗?我以前也尝试过patterns(),但遇到了一些问题。是的,当然可以,应该也可以。但是,我建议将这两个包用于Hadley Wickham为提高R的使用性而提出的两个包。请注意,它应该是mellet[,variable:=factor(variable,levels=unique(variable))]
而不是mellet[,var:=factor(variable,levels=unique(variable))]
为了节省内存。我认为“删除不再需要的列…”应该放在强制转换之后,或者可以将它们作为行删除,不是吗?@skan这只是一个关于潜在效率低下的警告<代码>数据。表
过度分配列列表向量,以便有空间高效地添加其他列,请参见?truelength
。在前面的处理步骤中,某个操作(警告中提到了某些操作)导致过度分配丢失。
measure_list <- split(measurevars, split_on)
measurenames <- unique(groups)
melt(setDT(mydata),
measure = measure_list,
value.name = measurenames,
variable.name = "measure")
# ID measure ZA BB
# 1: 1 1 1 3
# 2: 2 1 2 3
# 3: 3 1 3 3
# 4: 4 1 4 3
# 5: 5 1 5 3
# 6: 1 2 5 6
# 7: 2 2 4 6
# 8: 3 2 3 6
# 9: 4 2 2 6
#10: 5 2 1 6
# split the non-ID variables into groups based on their name suffix
myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))
# append variables by row after setting the regularizing variable names, cbind ID
cbind(mydata[1],
do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
ID ZA BB
1.1 1 1 3
1.2 2 2 3
1.3 3 3 3
1.4 4 4 3
1.5 5 5 3
2.1 1 5 6
2.2 2 4 6
2.3 3 3 6
2.4 4 2 6
2.5 5 1 6
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)
idvars = grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable),
measure = sub('.*_', '', variable), variable = NULL)]
temp[,var:=factor(var, levels=unique(var))]
dcast( temp, ... ~ var, value.var='value' )
setDT(mydata)
# add unique row number to join on later
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]
# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\\d$")
# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)
# split column names of measure.vars
# Note that "variable" is reused to save memory
molten[, c("variable", "measure") := tstrsplit(variable, "_")]
# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]
# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]
# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
# ID rn measure ZA BB CC
# 1: 1 1 1 1 3 NA
# 2: 1 1 2 5 6 NA
# 3: 1 1 7 NA NA 6
# 4: 2 2 1 2 3 NA
# 5: 2 2 2 4 6 NA
# 6: 2 2 7 NA NA 5
# 7: 3 3 1 3 3 NA
# 8: 3 3 2 3 6 NA
# 9: 3 3 7 NA NA 4
#10: 4 4 1 4 3 NA
#11: 4 4 2 2 6 NA
#12: 4 4 7 NA NA 3
#13: 5 5 1 5 3 NA
#14: 5 5 2 1 6 NA
#15: 5 5 7 NA NA 2
molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))
melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
][, variable := factor(variable, levels = unique(variable))
][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]
ID measure ZA BB CC
1: 1 1 1 3 NA
2: 1 2 5 6 NA
3: 1 7 NA NA 6
4: 2 1 2 3 NA
5: 2 2 4 6 NA
6: 2 7 NA NA 5
7: 3 1 3 3 NA
8: 3 2 3 6 NA
9: 3 7 NA NA 4
10: 4 1 4 3 NA
11: 4 2 2 6 NA
12: 4 7 NA NA 3
13: 5 1 5 3 NA
14: 5 2 1 6 NA
15: 5 7 NA NA 2