使用sum、length和grep聚合data.table
让我们创建一个data.table:使用sum、length和grep聚合data.table,r,data.table,aggregate,R,Data.table,Aggregate,让我们创建一个data.table: dt <- data.table(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2)) dt x.1 x.2 x.3 vessel Year 1: 1 1 2 a 2012 2: 2 2 3 a 2013 3: 3 3 4 a 2014 4: 4 4 5
dt <- data.table(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
x.1 x.2 x.3 vessel Year
1: 1 1 2 a 2012
2: 2 2 3 a 2013
3: 3 3 4 a 2014
4: 4 4 5 a 2015
5: 5 5 6 b 2012
6: 6 6 7 b 2013
7: 7 7 8 b 2014
8: 8 8 9 b 2015
这就是我想要的,但在我的真实数据中,我有很多列,所以我想使用grep或%like%,但我无法让它工作。我的想法与此一致:
dt[,grep("x",colnames(dt)),with = FALSE])
但是如何将其与聚合合并?您可以使用
lappy
在所有(.SD
)或多个列(使用.SDcols
选择)上应用函数:
以下操作也可用于选择名称中带有“x”的所有列:
dt[, c(lapply(.SD, sum), vessel=uniqueN(vessel)),
by=Year,
.SDcols=grepl("^x", names(dt))
]
我不太明白你的问题,但是你想对grep做什么可以用这样的方法解决
dt <- data.frame(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
dt[unlist(lapply(colnames(dt),function(v){grepl("x",v)}))]
dt如果你真的需要它来提高效率:
> dt[, .SD
][, .N, .(vessel, Year)
][, .N, .(Year)
][, copy(dt)[.SD, vessels := i.N, on='Year']
][, vessel := NULL
][, melt(.SD, id.vars=c('Year', 'vessels'))
][, .(value=sum(value)), .(Year, vessels, variable)
][, dcast(.SD, ... ~ variable, value.var='value')
][, setcolorder(.SD, c(setdiff(colnames(.SD), 'vessels'), 'vessels'))
][order(Year)
]
Year x.1 x.2 x.3 vessels
1: 2012 6 6 8 2
2: 2013 8 8 10 2
3: 2014 10 10 12 2
4: 2015 12 12 14 2
>
如果你有很多列要聚合,那么考虑使用<代码> MELTER()/CUD>重新编译数据,并使用<代码> dCaster()/<代码>:
聚合可能是值得的。
最后需要使用联接来追加:
提示
measure.vars
参数到melt()
允许定义/选择/限制相关度量列李>
dcast()
的子集
参数允许选择特定的测量变量或排除
- 您可以在
dcast()中使用多个聚合函数
这样可以做一些奇特的事情,如:
dcast(molten, Year ~ variable, list(mean, sum, max), subset = .(variable == "x.2")
)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
# Year value_mean_x.2 value_sum_x.2 value_max_x.2 vessels
#1: 2012 3 6 5 2
#2: 2013 4 8 6 2
#3: 2014 5 10 7 2
#4: 2015 6 12 8 2
我认为.SDcols
无疑是这里的关键,但我认为dt[,c(lappy(.SD,sum),船只=uniqueN(船只)),by=Year,.SDcols=grepl(“^x”,names(dt))]
可能给出OP的确切结果requested@thelatemail请作为答案发布,然后我会接受:)仅供参考,grep(patt,x,value=TRUE)
应该给出与x[grep(patt,x)]相同的答案。
我根据最近的邮件建议更正了答案@在最近的邮件中,如果你想获得学分,请随意发布你自己的答案。@Stefan-一切都好-我已经有很多虚构的互联网点:-)为了节省打字时间,你可以按年写by
而不是按年写by=list(Year=dt$Year)
。这避免了对dt
@UweBlock的双重引用。如果需要更多的by,则by=列表(Year=dt$Year)更容易扩展该函数。但我将删除dt$。
> dt[, .SD
][, .N, .(vessel, Year)
][, .N, .(Year)
][, copy(dt)[.SD, vessels := i.N, on='Year']
][, vessel := NULL
][, melt(.SD, id.vars=c('Year', 'vessels'))
][, .(value=sum(value)), .(Year, vessels, variable)
][, dcast(.SD, ... ~ variable, value.var='value')
][, setcolorder(.SD, c(setdiff(colnames(.SD), 'vessels'), 'vessels'))
][order(Year)
]
Year x.1 x.2 x.3 vessels
1: 2012 6 6 8 2
2: 2013 8 8 10 2
3: 2014 10 10 12 2
4: 2015 12 12 14 2
>
molten <- melt(dt, id.vars = c("Year", "vessel"))
molten
# Year vessel variable value
# 1: 2012 a x.1 1
# 2: 2013 a x.1 2
# 3: 2014 a x.1 3
# 4: 2015 a x.1 4
# 5: 2012 b x.1 5
# ...
#19: 2014 a x.3 4
#20: 2015 a x.3 5
#21: 2012 b x.3 6
#22: 2013 b x.3 7
#23: 2014 b x.3 8
#24: 2015 b x.3 9
# Year vessel variable value
dcast(molten, Year ~ variable, sum)
# Year x.1 x.2 x.3
#1: 2012 6 6 8
#2: 2013 8 8 10
#3: 2014 10 10 12
#4: 2015 12 12 14
dt[, .(vessels = uniqueN(vessel)), Year]
# Year vessels
#1: 2012 2
#2: 2013 2
#3: 2014 2
#4: 2015 2
dcast(molten, Year ~ variable, sum)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
# Year x.1 x.2 x.3 vessels
#1: 2012 6 6 8 2
#2: 2013 8 8 10 2
#3: 2014 10 10 12 2
#4: 2015 12 12 14 2
dcast(molten, Year ~ variable, list(mean, sum, max), subset = .(variable == "x.2")
)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
# Year value_mean_x.2 value_sum_x.2 value_max_x.2 vessels
#1: 2012 3 6 5 2
#2: 2013 4 8 6 2
#3: 2014 5 10 7 2
#4: 2015 6 12 8 2