R 扩展与dcast
我有一张这样的桌子R 扩展与dcast,r,reshape2,tidyr,R,Reshape2,Tidyr,我有一张这样的桌子 > head(dt2) Weight Height Fitted interval limit value 1 65.6 174.0 71.91200 pred lwr 53.73165 2 80.7 193.5 91.63237 pred lwr 73.33198 3 72.6 186.5 84.55326 pred lwr 66.31751 4 78.8 187.2 85.26117
> head(dt2)
Weight Height Fitted interval limit value
1 65.6 174.0 71.91200 pred lwr 53.73165
2 80.7 193.5 91.63237 pred lwr 73.33198
3 72.6 186.5 84.55326 pred lwr 66.31751
4 78.8 187.2 85.26117 pred lwr 67.02004
5 74.8 181.5 79.49675 pred lwr 61.29244
6 86.4 184.0 82.02501 pred lwr 63.80652
> head(reshape2::dcast(dt2,
Weight + Height + Fitted + interval ~ limit,
fun.aggregate = mean))
Weight Height Fitted interval lwr upr
1 42.0 153.4 51.07920 conf 49.15463 53.00376
2 42.0 153.4 51.07920 pred 32.82122 69.33717
3 43.2 160.0 57.75378 conf 56.35240 59.15516
4 43.2 160.0 57.75378 pred 39.54352 75.96404
5 44.8 149.5 47.13512 conf 44.87642 49.39382
6 44.8 149.5 47.13512 pred 28.83891 65.43133
我希望它像这样
> head(dt2)
Weight Height Fitted interval limit value
1 65.6 174.0 71.91200 pred lwr 53.73165
2 80.7 193.5 91.63237 pred lwr 73.33198
3 72.6 186.5 84.55326 pred lwr 66.31751
4 78.8 187.2 85.26117 pred lwr 67.02004
5 74.8 181.5 79.49675 pred lwr 61.29244
6 86.4 184.0 82.02501 pred lwr 63.80652
> head(reshape2::dcast(dt2,
Weight + Height + Fitted + interval ~ limit,
fun.aggregate = mean))
Weight Height Fitted interval lwr upr
1 42.0 153.4 51.07920 conf 49.15463 53.00376
2 42.0 153.4 51.07920 pred 32.82122 69.33717
3 43.2 160.0 57.75378 conf 56.35240 59.15516
4 43.2 160.0 57.75378 pred 39.54352 75.96404
5 44.8 149.5 47.13512 conf 44.87642 49.39382
6 44.8 149.5 47.13512 pred 28.83891 65.43133
但是使用tidyr::spread
,我该怎么做呢
我用的是
> tidyr::spread(dt2, limit, value)
但是得到了错误
Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601)
随机10行::
> dt[sample(nrow(dt), 10), ]
Weight Height Fitted interval limit value
1253 52.2 162.5 60.28203 conf upr 61.51087
426 49.1 158.8 56.54022 pred upr 74.75756
1117 78.4 184.5 82.53066 conf lwr 80.98778
1171 85.9 166.4 64.22611 conf lwr 63.21254
948 61.4 177.8 75.75494 conf lwr 74.66393
384 90.9 172.7 70.59731 pred lwr 52.41828
289 75.9 172.7 70.59731 pred lwr 52.41828
3 44.8 149.5 47.13512 pred lwr 28.83891
774 87.3 182.9 80.91258 pred upr 99.12445
772 86.4 175.3 73.22669 pred upr 91.40919
假设您从如下数据开始:
mydf
# Weight Height Fitted interval limit value
# 1 42 153.4 51.0792 conf lwr 49.15463
# 2 42 153.4 51.0792 pred lwr 32.82122
# 3 42 153.4 51.0792 conf upr 53.00376
# 4 42 153.4 51.0792 pred upr 69.33717
# 5 42 153.4 51.0792 conf lwr 60.00000
# 6 42 153.4 51.0792 pred lwr 90.00000
注意分组列(1到5)第5行和第6行中的重复。这基本上就是“tidyr”告诉你的。第一行和第五行是重复的,第二行和第六行也是重复的
tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)
正如@Jaap所建议的,解决方案是首先“总结”数据。由于“tidyr”仅用于重塑数据(与“重塑2”不同,后者聚合并重塑数据),因此在更改数据表单之前,需要使用“dplyr”执行聚合。在这里,我用“值”列的summary
实现了这一点
如果您在summary
步骤停止执行,您会发现原来的6行数据集“缩小”为4行。现在,spread
将按预期工作
mydf %>%
group_by(Weight, Height, Fitted, interval, limit) %>%
summarise(value = mean(value)) %>%
spread(limit, value)
# Source: local data frame [2 x 6]
#
# Weight Height Fitted interval lwr upr
# (dbl) (dbl) (dbl) (chr) (dbl) (dbl)
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
这与dcast
的预期输出与fun.aggregate=mean
匹配
reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
# Weight Height Fitted interval lwr upr
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
样本数据:
mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4,
153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,
51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",
"conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,
2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),
value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,
90)), .Names = c("Weight", "Height", "Fitted", "interval",
"limit", "value"), row.names = c(NA, 6L), class = "data.frame")
mydf以下是数据。表
是dplyr
的替代方案。使用阿南达回答中的mydf
library(data.table)
library(magrittr)
library(tidyr)
DT <- data.table(mydf)
如果此limit==…
看起来太难编码,您可以先聚合为长格式,然后扩展。这是因为一旦聚合,就不会有重复
DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
spread(key = "limit", value = "value")
两者都能让你明白
# Weight Height Fitted interval lwr upr
#1: 42 153.4 51.0792 conf 54.57731 53.00376
#2: 42 153.4 51.0792 pred 61.41061 69.33717
您的示例在limit
中不包含upr
,在interval
中也不包含conf
,这意味着您的预期结果是不可再现的。为什么不将其保留为长格式,只进行聚合?请参阅base R、dplyr和data.table。虽然我已经用dcast完成了这项工作,但我想用tidyr完成这项工作只是为了学习@mtoto这只是我的数据集的一部分,我将对其进行编辑,为您提供一个随机样本,以确保再现性。这应该是有效的:dt2%>%groupby(interval,limit)%%>%summary_each(funs(mean))%%>%spread(limit,value,-c(1:3))
按interval和limit进行汇总,只给了我两行。谢谢,实际上我说的是dplyr
和tidyr
。我已经用reformae2
解决了这个问题,但我想知道如何使用这些特定的包。无论如何谢谢你!谢谢我在考虑如何处理聚合函数。我想Hadely希望将tidyr
与dplyr
一起使用。这是一个很好的答案,让我了解了dcast
和spread
之间的区别。非常感谢。