将行透视到列中,每个测量R的计数值为
我有一个正在使用的示例数据帧将行透视到列中,每个测量R的计数值为,r,data.table,dplyr,plyr,reshape2,R,Data.table,Dplyr,Plyr,Reshape2,我有一个正在使用的示例数据帧 ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B") TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5) Measurement <- c("Len","Le
ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5)
Measurement <- c("Len","Len","Len","Wid","Ht","Ht","Dep","Brt","Ht","Ht","Dep","Dep"
,"Dep","Dep","Len","Len","Ht","Ht","Brt","Brt","Wid","Wid")
df1 <- data.frame(ID,TARG_AVG,Measurement)
2) 一旦得到上述输出,我想对行进行子集,以便得到一个过滤输出,该输出返回至少有2个度量值计数的行>2。这里我想要的输出是
ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1 A 2.1 3 1 2 0 0 6
3 A 2.5 0 0 2 2 0 4
4 B 3.1 2 0 0 2 0 4
5 B 3.3 0 0 2 0 2 4
ID TARG_AVG Measurement
1 A 2.1 Len
2 A 2.1 Len
3 A 2.1 Len
4 A 2.1 Ht
5 A 2.1 Ht
6 A 2.5 Ht
7 A 2.5 Ht
8 A 2.5 Dep
9 A 2.5 Dep
10 B 3.1 Len
11 B 3.1 Len
12 B 3.1 Dep
13 B 3.1 Dep
14 B 3.3 Ht
15 B 3.3 Ht
16 B 3.3 Brt
17 B 3.3 Brt
3) 最后,我想将这些列转回到只包含度量值的行中我想要的输出如下
ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1 A 2.1 3 1 2 0 0 6
3 A 2.5 0 0 2 2 0 4
4 B 3.1 2 0 0 2 0 4
5 B 3.3 0 0 2 0 2 4
ID TARG_AVG Measurement
1 A 2.1 Len
2 A 2.1 Len
3 A 2.1 Len
4 A 2.1 Ht
5 A 2.1 Ht
6 A 2.5 Ht
7 A 2.5 Ht
8 A 2.5 Dep
9 A 2.5 Dep
10 B 3.1 Len
11 B 3.1 Len
12 B 3.1 Dep
13 B 3.1 Dep
14 B 3.3 Ht
15 B 3.3 Ht
16 B 3.3 Brt
17 B 3.3 Brt
目前我正在学习Reformae2、dplyr和data.table软件包,如果有人能帮我解决这个问题,为我指明正确的方向,这将非常有用 在这种情况下,您不需要
tidyr
。您只需要dplyr
:
df2 <- df1 %>%
group_by(ID, TARG_AVG) %>% # Group by ID and TARG_AVG
mutate(count=n()) %>% # Count how many are there for each combination of ID and TARG_AVG
filter(count > 2) %>% # Only keep the ones with more than 2 (I think you meant > 2)
select(-count) # Remove the auxiliary variable count
df2
步骤1和2。汇总、计数、按测量数量和分布过滤
df1 <- df0 %>%
group_by(ID, TARG_AVG, Measurement) %>%
summarise(count=n()) %>%
group_by(ID, TARG_AVG) %>% # Step "2"
filter(n() >= 2) %>% # Step "2"
spread(Measurement, count, fill = 0) %>% # Resume step "1"
mutate(Measurement.count = Len + Wid + Ht + Dep + Brt)
df1
df1%
分组依据(ID、目标平均值、测量值)%>%
汇总(计数=n())%>%
分组依据(ID,目标平均值)%>%#步骤“2”
过滤器(n()>=2)%>%#步骤“2”
排列(测量、计数、填充=0)%>%#继续执行步骤“1”
突变(Measurement.count=Len+Wid+Ht+Dep+Brt)
df1
第三步。重塑
df3 <- df2 %>%
select(-Measurement.count) %>%
gather(Measurement, dummy, Brt:Wid) %>%
select(-dummy)
df3
df3%
选择(-Measurement.count)%>%
聚集(测量,虚拟,Brt:Wid)%>%
选择(-dummy)
df3
最新解决方案
library(data.table) #v 1.9.6+
setDT(df1)[, indx := .N, by = names(df1)
][indx > 1, if(uniqueN(Measurement) > 1) .SD, by = .(ID, TARG_AVG)]
# ID TARG_AVG Measurement indx
# 1: A 2.1 Len 3
# 2: A 2.1 Len 3
# 3: A 2.1 Len 3
# 4: A 2.1 Ht 2
# 5: A 2.1 Ht 2
# 6: A 2.5 Ht 2
# 7: A 2.5 Ht 2
# 8: A 2.5 Dep 2
# 9: A 2.5 Dep 2
# 10: B 3.1 Dep 2
# 11: B 3.1 Dep 2
# 12: B 3.1 Len 2
# 13: B 3.1 Len 2
# 14: B 3.3 Ht 2
# 15: B 3.3 Ht 2
# 16: B 3.3 Brt 2
# 17: B 3.3 Brt 2
library(data.table)
## dcast the data (no need in total)
res <- dcast(df1, ID + TARG_AVG ~ Measurement)
## filter by at least 2 incidents of at least length 2
res <- res[rowSums(res[-(1:2)] > 1) > 1,]
## melt the data back and filter again by at least 2 incidents
res <- melt(setDT(res), id = 1:2)[value > 1]
## Expand the data back
res[, .SD[rep(.I, value)]]
或同等的dplyr
df1 %>%
group_by(ID, TARG_AVG, Measurement) %>%
filter(n() > 1) %>%
group_by(ID, TARG_AVG) %>%
filter(n_distinct(Measurement) > 1)
旧的解决方案
library(data.table) #v 1.9.6+
setDT(df1)[, indx := .N, by = names(df1)
][indx > 1, if(uniqueN(Measurement) > 1) .SD, by = .(ID, TARG_AVG)]
# ID TARG_AVG Measurement indx
# 1: A 2.1 Len 3
# 2: A 2.1 Len 3
# 3: A 2.1 Len 3
# 4: A 2.1 Ht 2
# 5: A 2.1 Ht 2
# 6: A 2.5 Ht 2
# 7: A 2.5 Ht 2
# 8: A 2.5 Dep 2
# 9: A 2.5 Dep 2
# 10: B 3.1 Dep 2
# 11: B 3.1 Dep 2
# 12: B 3.1 Len 2
# 13: B 3.1 Len 2
# 14: B 3.3 Ht 2
# 15: B 3.3 Ht 2
# 16: B 3.3 Brt 2
# 17: B 3.3 Brt 2
library(data.table)
## dcast the data (no need in total)
res <- dcast(df1, ID + TARG_AVG ~ Measurement)
## filter by at least 2 incidents of at least length 2
res <- res[rowSums(res[-(1:2)] > 1) > 1,]
## melt the data back and filter again by at least 2 incidents
res <- melt(setDT(res), id = 1:2)[value > 1]
## Expand the data back
res[, .SD[rep(.I, value)]]
第二步
res <- res[res$"(all)" > 2,]
这是一个data.table解决方案,可能会快一点。我发现,与将任务分为两个步骤相比,在j中使用by进行子集设置可能有点慢:[1]添加额外的列,您可以使用这些列进行筛选(此处执行by),[2]一次性执行筛选(不使用by):
这直接解决了第三个问题。如果你想要中间步骤,那么你必须坚持你拥有的,或者用
dplyr
和tidyr
^^复制它,但是有人已经做了。如果你真的想要一个使用dplyr
和tidyr
的版本,我可以发布它<代码>重塑和重塑2
已经过去了!菲利佩。你能为dplyr和tidyr发布解决方案吗?请确保您应用的过滤器基于测量值,而不是总数。我希望过滤器返回至少有2个测量值>=2的行,而不是总数>2的行。请检查。然后我建议在步骤1中进行过滤。问题是,你必须明确地写下你想要检查测量值的变量,而不是简单地以不同的方式分组。我的荣幸!哈德利的新软件包非常快,不是吗^^感谢您提供的解决方案,但我想指出的一点是,我确实没有按(>2)过滤总数。我真的想根据测量值过滤数据集(即:只有当其中2个测量值高于2时,我才想包括该行。例如,如果总数为5,且测量值的组合为3+1+1,那么我不想包括该行,因为只有一个测量值高于2。您能检查一下吗?我上面的df1示例可能不是最好的,因为过滤器可以应用于其总数elf而不是度量值。我很抱歉提供这样一个示例。@Sharath我有which(res[,c(3:7)]>=2,arr.ind=TRUE)->ind;res[unique(ind[,1]),]%%>%arrange(ID,target_AVG)
第二部分。我想知道这是否是你的意思。我添加了另一个解决方案,也尝试一下。哇,大卫。它在我的大数据集上运行得非常好,速度也非常快。这是一些令人惊奇的东西。非常感谢你耐心地帮助我。
res <- res[res$"(all)" > 2,]
library(data.table)
setDT(df1)[, if(.N > 2) .SD, by = .(ID, TARG_AVG)]
> cTbl[, N := .N, .(ID, TARG_AVG, Measurement)
][N > 1, NMgt1 := uniqueN(Measurement) > 1, .(ID, TARG_AVG)
][N > 1 & NMgt1
][, c('N', 'NMgt1') := NULL
][]
ID TARG_AVG Measurement
1: A 2.1 Len
2: A 2.1 Len
3: A 2.1 Len
4: A 2.1 Ht
5: A 2.1 Ht
6: A 2.5 Ht
7: A 2.5 Ht
8: A 2.5 Dep
9: A 2.5 Dep
10: B 3.1 Dep
11: B 3.1 Dep
12: B 3.1 Len
13: B 3.1 Len
14: B 3.3 Ht
15: B 3.3 Ht
16: B 3.3 Brt
17: B 3.3 Brt
>