R 将包含同一变量的多列折叠为一列
我的数据如下所示:R 将包含同一变量的多列折叠为一列,r,reshape,R,Reshape,我的数据如下所示: ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4 A 1 0 0 0 A 1 0 0 0 A 1 0 0 0 B 0
ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4
A 1 0 0 0
A 1 0 0 0
A 1 0 0 0
B 0 1 0 0
C 0 0 0 1
C 0 1 0 0
D 0 0 0 1
E 0 0 1 0
E 0 1 0 0
E 0 0 1 0
诊断_1:诊断_4都是二进制的,表示诊断的存在(1)或不存在(0)。我想做的是创建一个如下所示的数据帧:
ID Diagnosis
A 1
A 1
A 1
B 2
C 4
C 2
D 4
E 3
E 2
E 3
无论我读了多少次关于重塑/重塑2/tidyr的文档,我都无法理解它们的实现
我可以使用dplyr的mutate解决我的问题,但这是一种时间密集、迂回的方式来实现我的目标
编辑:编辑数据以更真实地表示我的实际数据帧。您可以尝试
max.col
获取每行的列索引
data.frame(ID=df1$ID, Diagnosis=max.col(df1[-1]))
# ID Diagnosis
#1 A 1
#2 B 2
#3 C 2
#4 D 4
#5 E 3
或者,获取索引的另一个选项是
unname(which(t(df1[-1])!=0, arr.ind=TRUE)[,1])
#[1] 1 2 2 4 3
基准
使用microbenchmark
library(microbenchmark)
microbenchmark(akrun(), Grothendieck(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun() 1.019108 1.252443 1.084306 1.180743 1.16463 0.6928535 20 a
#Grothendieck() 1.000000 1.000000 1.000000 1.000000 1.00000 1.0000000 20 a
数据
df1尝试矩阵乘法:
nc <- ncol(DF)
data.frame(ID = DF$ID, Diagnosis = as.matrix(DF[-1]) %*% seq(nc-1))
注意:我们将其用作输入:
Lines <- "ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4
A 1 0 0 0
B 0 1 0 0
C 0 1 0 0
D 0 0 0 1
E 0 0 1 0"
DF <- read.table(text = Lines, header = TRUE)
行既然您提到了“重塑2”、“tidyr”和相关工具,下面是一些需要考虑的选项:
## Using "tidyr" and "dplyr"
library(dplyr)
library(tidyr)
df1 %>%
gather(var, val, -ID) %>%
separate(var, into = c("var", "value")) %>%
filter(val == 1) %>%
select(ID, value)
# ID value
# 1 A 1
# 2 B 2
# 3 C 2
# 4 E 3
# 5 D 4
## Getting half-way there with "melt" from "reshape2"
library(reshape2)
melt(replace(df1, df1 == 0, NA), id.vars = "ID", na.rm = TRUE)
# ID variable value
# 1 A Diagnosis_1 1
# 7 B Diagnosis_2 1
# 8 C Diagnosis_2 1
# 15 E Diagnosis_3 1
# 19 D Diagnosis_4 1
考虑到您的更新,您只需要添加一个辅助ID:
library(dplyr)
library(tidyr)
mydf %>%
group_by(ID) %>%
mutate(ID2 = row_number()) %>%
gather(var, val, Diagnosis_1:Diagnosis_4) %>%
separate(var, into = c("var", "value")) %>%
filter(val == 1) %>%
arrange(ID, ID2)
# Source: local data frame [10 x 5]
#
# ID ID2 var value val
# 1 A 1 Diagnosis 1 1
# 2 A 2 Diagnosis 1 1
# 3 A 3 Diagnosis 1 1
# 4 B 1 Diagnosis 2 1
# 5 C 1 Diagnosis 4 1
# 6 C 2 Diagnosis 2 1
# 7 D 1 Diagnosis 4 1
# 8 E 1 Diagnosis 3 1
# 9 E 2 Diagnosis 2 1
# 10 E 3 Diagnosis 3 1
“var”和“val”指的是什么?@Makairamarakami,这是你在收集专栏时要创建的两个专栏。谢谢你的解释。不幸的是,当我运行一组tidyr/dplyr命令时,我遇到了以下错误:“错误:值未在处拆分为两部分”,这将继续列出一长串数字。@MakairaMurakami,与您共享的示例中的列相比,您是否有更多列/名称不同的列?是的,我有5个名为“diagnosis_u_u1”的诊断列通过“诊断4”,然后是“诊断10”以及ID列。考虑到您提供的代码,这为什么会有不同?
ID Diagnosis
1 A 1
2 B 2
3 C 2
4 D 4
5 E 3
Lines <- "ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4
A 1 0 0 0
B 0 1 0 0
C 0 1 0 0
D 0 0 0 1
E 0 0 1 0"
DF <- read.table(text = Lines, header = TRUE)
## Using "tidyr" and "dplyr"
library(dplyr)
library(tidyr)
df1 %>%
gather(var, val, -ID) %>%
separate(var, into = c("var", "value")) %>%
filter(val == 1) %>%
select(ID, value)
# ID value
# 1 A 1
# 2 B 2
# 3 C 2
# 4 E 3
# 5 D 4
## Getting half-way there with "melt" from "reshape2"
library(reshape2)
melt(replace(df1, df1 == 0, NA), id.vars = "ID", na.rm = TRUE)
# ID variable value
# 1 A Diagnosis_1 1
# 7 B Diagnosis_2 1
# 8 C Diagnosis_2 1
# 15 E Diagnosis_3 1
# 19 D Diagnosis_4 1
library(dplyr)
library(tidyr)
mydf %>%
group_by(ID) %>%
mutate(ID2 = row_number()) %>%
gather(var, val, Diagnosis_1:Diagnosis_4) %>%
separate(var, into = c("var", "value")) %>%
filter(val == 1) %>%
arrange(ID, ID2)
# Source: local data frame [10 x 5]
#
# ID ID2 var value val
# 1 A 1 Diagnosis 1 1
# 2 A 2 Diagnosis 1 1
# 3 A 3 Diagnosis 1 1
# 4 B 1 Diagnosis 2 1
# 5 C 1 Diagnosis 4 1
# 6 C 2 Diagnosis 2 1
# 7 D 1 Diagnosis 4 1
# 8 E 1 Diagnosis 3 1
# 9 E 2 Diagnosis 2 1
# 10 E 3 Diagnosis 3 1