R 将包含同一变量的多列折叠为一列

R 将包含同一变量的多列折叠为一列,r,reshape,R,Reshape,我的数据如下所示: ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4 A 1 0 0 0 A 1 0 0 0 A 1 0 0 0 B 0

我的数据如下所示:

ID   Diagnosis_1   Diagnosis_2   Diagnosis_3   Diagnosis_4
A        1             0             0             0
A        1             0             0             0
A        1             0             0             0
B        0             1             0             0
C        0             0             0             1
C        0             1             0             0
D        0             0             0             1
E        0             0             1             0
E        0             1             0             0
E        0             0             1             0
诊断_1:诊断_4都是二进制的,表示诊断的存在(1)或不存在(0)。我想做的是创建一个如下所示的数据帧:

ID   Diagnosis
A        1
A        1
A        1
B        2
C        4
C        2
D        4
E        3
E        2
E        3
无论我读了多少次关于重塑/重塑2/tidyr的文档,我都无法理解它们的实现

我可以使用dplyr的mutate解决我的问题,但这是一种时间密集、迂回的方式来实现我的目标


编辑:编辑数据以更真实地表示我的实际数据帧。

您可以尝试
max.col
获取每行的列索引

 data.frame(ID=df1$ID, Diagnosis=max.col(df1[-1]))
 #    ID Diagnosis
 #1  A         1
 #2  B         2
 #3  C         2
 #4  D         4
 #5  E         3
或者,获取索引的另一个选项是

 unname(which(t(df1[-1])!=0, arr.ind=TRUE)[,1])
 #[1] 1 2 2 4 3
基准 使用
microbenchmark

library(microbenchmark)
microbenchmark(akrun(), Grothendieck(), unit='relative', times=20L)
#Unit: relative
#         expr      min       lq     mean   median      uq       max neval cld
 #      akrun() 1.019108 1.252443 1.084306 1.180743 1.16463 0.6928535    20  a
#Grothendieck() 1.000000 1.000000 1.000000 1.000000 1.00000 1.0000000    20  a
数据
df1尝试矩阵乘法:

nc <- ncol(DF)
data.frame(ID = DF$ID, Diagnosis = as.matrix(DF[-1]) %*% seq(nc-1))
注意:我们将其用作输入:

Lines <- "ID   Diagnosis_1   Diagnosis_2   Diagnosis_3   Diagnosis_4
A        1             0             0             0
B        0             1             0             0
C        0             1             0             0
D        0             0             0             1
E        0             0             1             0"

DF <- read.table(text = Lines, header = TRUE)
行既然您提到了“重塑2”、“tidyr”和相关工具,下面是一些需要考虑的选项:

## Using "tidyr" and "dplyr"
library(dplyr)
library(tidyr)

df1 %>%
  gather(var, val, -ID) %>%
  separate(var, into = c("var", "value")) %>%
  filter(val == 1) %>%
  select(ID, value)
#   ID value
# 1  A     1
# 2  B     2
# 3  C     2
# 4  E     3
# 5  D     4

## Getting half-way there with "melt" from "reshape2"
library(reshape2)
melt(replace(df1, df1 == 0, NA), id.vars = "ID", na.rm = TRUE)
#    ID    variable value
# 1   A Diagnosis_1     1
# 7   B Diagnosis_2     1
# 8   C Diagnosis_2     1
# 15  E Diagnosis_3     1
# 19  D Diagnosis_4     1

考虑到您的更新,您只需要添加一个辅助ID:

library(dplyr)
library(tidyr)

mydf %>%
  group_by(ID) %>%
  mutate(ID2 = row_number()) %>%
  gather(var, val, Diagnosis_1:Diagnosis_4) %>%
  separate(var, into = c("var", "value")) %>%
  filter(val == 1) %>%
  arrange(ID, ID2)
# Source: local data frame [10 x 5]
# 
#    ID ID2       var value val
# 1   A   1 Diagnosis     1   1
# 2   A   2 Diagnosis     1   1
# 3   A   3 Diagnosis     1   1
# 4   B   1 Diagnosis     2   1
# 5   C   1 Diagnosis     4   1
# 6   C   2 Diagnosis     2   1
# 7   D   1 Diagnosis     4   1
# 8   E   1 Diagnosis     3   1
# 9   E   2 Diagnosis     2   1
# 10  E   3 Diagnosis     3   1

“var”和“val”指的是什么?@Makairamarakami,这是你在收集专栏时要创建的两个专栏。谢谢你的解释。不幸的是,当我运行一组tidyr/dplyr命令时,我遇到了以下错误:“错误:值未在处拆分为两部分”,这将继续列出一长串数字。@MakairaMurakami,与您共享的示例中的列相比,您是否有更多列/名称不同的列?是的,我有5个名为“diagnosis_u_u1”的诊断列通过“诊断4”,然后是“诊断10”以及ID列。考虑到您提供的代码,这为什么会有不同?
  ID Diagnosis
1  A         1
2  B         2
3  C         2
4  D         4
5  E         3
Lines <- "ID   Diagnosis_1   Diagnosis_2   Diagnosis_3   Diagnosis_4
A        1             0             0             0
B        0             1             0             0
C        0             1             0             0
D        0             0             0             1
E        0             0             1             0"

DF <- read.table(text = Lines, header = TRUE)
## Using "tidyr" and "dplyr"
library(dplyr)
library(tidyr)

df1 %>%
  gather(var, val, -ID) %>%
  separate(var, into = c("var", "value")) %>%
  filter(val == 1) %>%
  select(ID, value)
#   ID value
# 1  A     1
# 2  B     2
# 3  C     2
# 4  E     3
# 5  D     4

## Getting half-way there with "melt" from "reshape2"
library(reshape2)
melt(replace(df1, df1 == 0, NA), id.vars = "ID", na.rm = TRUE)
#    ID    variable value
# 1   A Diagnosis_1     1
# 7   B Diagnosis_2     1
# 8   C Diagnosis_2     1
# 15  E Diagnosis_3     1
# 19  D Diagnosis_4     1
library(dplyr)
library(tidyr)

mydf %>%
  group_by(ID) %>%
  mutate(ID2 = row_number()) %>%
  gather(var, val, Diagnosis_1:Diagnosis_4) %>%
  separate(var, into = c("var", "value")) %>%
  filter(val == 1) %>%
  arrange(ID, ID2)
# Source: local data frame [10 x 5]
# 
#    ID ID2       var value val
# 1   A   1 Diagnosis     1   1
# 2   A   2 Diagnosis     1   1
# 3   A   3 Diagnosis     1   1
# 4   B   1 Diagnosis     2   1
# 5   C   1 Diagnosis     4   1
# 6   C   2 Diagnosis     2   1
# 7   D   1 Diagnosis     4   1
# 8   E   1 Diagnosis     3   1
# 9   E   2 Diagnosis     2   1
# 10  E   3 Diagnosis     3   1