R 在不排序列的情况下从宽转换为长

R 在不排序列的情况下从宽转换为长,r,dataframe,data.table,transformation,R,Dataframe,Data.table,Transformation,我想将数据帧从宽格式转换为长格式 这是一个玩具示例: mydata <- data.frame(ID=1:5, ZA_1=1:5, ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2) ID ZA_1 ZA_2 BB_1 BB_2 CC_7 1 1 5 3 6 6 2 2 4 3 6 5 3 3 3 3 6 4 4 4 2

我想将数据帧从宽格式转换为长格式

这是一个玩具示例:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2
我不介意idvars列是否在开始时聚集在一起,或者它们是否也保持在原来的位置

IDZA_1ZA_2 TEMPBB_1BB_2 CC_2 CC_1

会是

ID ZA TEMP BB CC

我喜欢最后一个选择


另一个问题是,所有内容都转换为字符。

如果将列名列表传递给参数
measure=
,则可以同时熔化多个列。以可扩展的方式实现这一点的一种方法是:

  • 提取列名和相应的前两个字母:

    measurevars <- names(mydata)[grepl("_[1-9]$",names(mydata))]
    groups <- gsub("_[1-9]$","",measurevars)
    
  • 使用
    measurevars
    split()
    创建一个列表,并为
    melt()
    中的
    value.name=
    参数创建向量


    下面是一个使用基本R函数
    split.default
    do.call
    的方法

    # split the non-ID variables into groups based on their name suffix
    myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))
    
    # append variables by row after setting the regularizing variable names, cbind ID
    cbind(mydata[1],
          do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
        ID ZA BB
    1.1  1  1  3
    1.2  2  2  3
    1.3  3  3  3
    1.4  4  4  3
    1.5  5  5  3
    2.1  1  5  6
    2.2  2  4  6
    2.3  3  3  6
    2.4  4  2  6
    2.5  5  1  6
    
    #根据名称后缀将非ID变量拆分为组
    
    myList我终于找到了方法,修改了我的初始解决方案

    mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
    BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)
    
    idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
    temp <- melt(mydata, id.vars = idvars)  
    temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
    measure = sub('.*_', '', variable), variable = NULL)]  
    temp[,var:=factor(var, levels=unique(var))]
    dcast( temp,   ... ~ var, value.var='value' )
    

    mydataOP已经更新了他对自己问题的答案,抱怨中间
    melt()
    步骤的内存消耗,其中一半列是
    id.vars
    。他要求
    data.table
    需要一种直接的方法来完成,而不需要创建巨大的中间步骤

    嗯,
    data.table
    已经具备了这种能力,它被称为join

    给定来自Q的样本数据,通过仅使用一个id.var重新整形,然后将重新整形的结果与原始data.table连接,整个操作可以以更少内存消耗的方式实现:

    setDT(mydata)
    
    # add unique row number to join on later 
    # (leave `ID` col as placeholder for all other id.vars)
    mydata[, rn := seq_len(.N)]
    
    # define columns to be reshaped
    measure_cols <- stringr::str_subset(names(mydata), "_\\d$")
    
    # melt with only one id.vars column
    molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)
    
    # split column names of measure.vars
    # Note that "variable" is reused to save memory 
    molten[, c("variable", "measure") := tstrsplit(variable, "_")]
    
    # coerce names to factors in the same order as the columns appeared in mydata
    molten[, variable := forcats::fct_inorder(variable)]
    
    # remove columns no longer needed in mydata _before_ joining to save memory
    mydata[, (measure_cols) := NULL]
    
    # final dcast and right join
    result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
    result
    #    ID rn measure ZA BB CC
    # 1:  1  1       1  1  3 NA
    # 2:  1  1       2  5  6 NA
    # 3:  1  1       7 NA NA  6
    # 4:  2  2       1  2  3 NA
    # 5:  2  2       2  4  6 NA
    # 6:  2  2       7 NA NA  5
    # 7:  3  3       1  3  3 NA
    # 8:  3  3       2  3  6 NA
    # 9:  3  3       7 NA NA  4
    #10:  4  4       1  4  3 NA
    #11:  4  4       2  2  6 NA
    #12:  4  4       7 NA NA  3
    #13:  5  5       1  5  3 NA
    #14:  5  5       2  1  6 NA
    #15:  5  5       7 NA NA  2
    

    使用
    数据的替代方法。表

    melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
                            ][, variable := factor(variable, levels = unique(variable))
                              ][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]
    
    其中:


    就我所见,当您手动指定变量名时,它们是简单的解决方案。但是我需要一种自动化的方法,使用grep,因为我的整个数据集有3500个变量。@Jaap这个问题不一样,因为我关注两个问题:我不希望输出被重新排序,我需要一个内存效率高的解决方案,而不仅仅是像你链接的另一个问题那样简单的解决方案,我的模式更复杂。Uwe的解决方案是一个很好的解决方案,一年前就已经被接受了。没有人在你的链接提供了一个好的解决方案,所以请删除这个问题的“重复问题”标签,因为它不是一个重复的。Thx的解释。我重新打开了这个问题,并发布了一个替代解决方案。我的整个数据集有3500个变量。这只是一个玩具的例子。这就是为什么我使用更复杂的方法来自动查找要熔化和转换的变量,使用grep。简单地说,它们都以x结尾,其中x是一个位数。如何避免将所有内容都转换为字符?我已经尝试了您的代码和度量值名称,我认为最好使用gsub(“[1-9]$,”,measurevars)我将尝试它,而不是分组。为了不使用额外的包,我可以使用grep(“\\d$”、name(mydata2)、value=T)和melt[,var:=factor(variable,levels=unique(variable))]来代替吗?我以前也尝试过patterns(),但遇到了一些问题。是的,当然可以,应该也可以。但是,我建议将这两个包用于Hadley Wickham为提高R的使用性而提出的两个包。请注意,它应该是
    mellet[,variable:=factor(variable,levels=unique(variable))]
    而不是
    mellet[,var:=factor(variable,levels=unique(variable))]
    为了节省内存。我认为“删除不再需要的列…”应该放在强制转换之后,或者可以将它们作为行删除,不是吗?@skan这只是一个关于潜在效率低下的警告<代码>数据。表
    过度分配列列表向量,以便有空间高效地添加其他列,请参见
    ?truelength
    。在前面的处理步骤中,某个操作(警告中提到了某些操作)导致过度分配丢失。
    measure_list <- split(measurevars, split_on)
    measurenames <- unique(groups)
    
    melt(setDT(mydata), 
         measure = measure_list, 
         value.name = measurenames,
         variable.name = "measure")
    #    ID measure ZA BB
    # 1:  1       1  1  3
    # 2:  2       1  2  3
    # 3:  3       1  3  3
    # 4:  4       1  4  3
    # 5:  5       1  5  3
    # 6:  1       2  5  6
    # 7:  2       2  4  6
    # 8:  3       2  3  6
    # 9:  4       2  2  6
    #10:  5       2  1  6
    
    # split the non-ID variables into groups based on their name suffix
    myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))
    
    # append variables by row after setting the regularizing variable names, cbind ID
    cbind(mydata[1],
          do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
        ID ZA BB
    1.1  1  1  3
    1.2  2  2  3
    1.3  3  3  3
    1.4  4  4  3
    1.5  5  5  3
    2.1  1  5  6
    2.2  2  4  6
    2.3  3  3  6
    2.4  4  2  6
    2.5  5  1  6
    
    mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
    BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)
    
    idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
    temp <- melt(mydata, id.vars = idvars)  
    temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
    measure = sub('.*_', '', variable), variable = NULL)]  
    temp[,var:=factor(var, levels=unique(var))]
    dcast( temp,   ... ~ var, value.var='value' )
    
    setDT(mydata)
    
    # add unique row number to join on later 
    # (leave `ID` col as placeholder for all other id.vars)
    mydata[, rn := seq_len(.N)]
    
    # define columns to be reshaped
    measure_cols <- stringr::str_subset(names(mydata), "_\\d$")
    
    # melt with only one id.vars column
    molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)
    
    # split column names of measure.vars
    # Note that "variable" is reused to save memory 
    molten[, c("variable", "measure") := tstrsplit(variable, "_")]
    
    # coerce names to factors in the same order as the columns appeared in mydata
    molten[, variable := forcats::fct_inorder(variable)]
    
    # remove columns no longer needed in mydata _before_ joining to save memory
    mydata[, (measure_cols) := NULL]
    
    # final dcast and right join
    result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
    result
    #    ID rn measure ZA BB CC
    # 1:  1  1       1  1  3 NA
    # 2:  1  1       2  5  6 NA
    # 3:  1  1       7 NA NA  6
    # 4:  2  2       1  2  3 NA
    # 5:  2  2       2  4  6 NA
    # 6:  2  2       7 NA NA  5
    # 7:  3  3       1  3  3 NA
    # 8:  3  3       2  3  6 NA
    # 9:  3  3       7 NA NA  4
    #10:  4  4       1  4  3 NA
    #11:  4  4       2  2  6 NA
    #12:  4  4       7 NA NA  3
    #13:  5  5       1  5  3 NA
    #14:  5  5       2  1  6 NA
    #15:  5  5       7 NA NA  2
    
    molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))
    
    melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
                            ][, variable := factor(variable, levels = unique(variable))
                              ][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]
    
        ID measure ZA BB CC
     1:  1       1  1  3 NA
     2:  1       2  5  6 NA
     3:  1       7 NA NA  6
     4:  2       1  2  3 NA
     5:  2       2  4  6 NA
     6:  2       7 NA NA  5
     7:  3       1  3  3 NA
     8:  3       2  3  6 NA
     9:  3       7 NA NA  4
    10:  4       1  4  3 NA
    11:  4       2  2  6 NA
    12:  4       7 NA NA  3
    13:  5       1  5  3 NA
    14:  5       2  1  6 NA
    15:  5       7 NA NA  2