R data.table-按列分组包括列表_R_Data.table

R data.table-按列分组包括列表

R data.table-按列分组包括列表,r,data.table,R,Data.table,我尝试使用R中data.table包的groupby函数 start <- as.Date('2014-1-1') end <- as.Date('2014-1-6') time.span <- seq(start, end, "days") a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b')) date value group

我尝试使用R中data.table包的groupby函数

start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))

        date  value group
1   2014-01-01  1   a
2   2014-01-02  2   a
3   2014-01-03  3   b
4   2014-01-04  4   b
5   2014-01-05  5   a
6   2014-01-06  6   b

a[,mean(value),by=group]
> group      V1
 1:   a    2.6667
 2:   b    4.3333

对于data.table包，这可能吗

更新

我的第一个问题解决了。在“拆分”data.table之后，在我的例子中，计算不同的因子（基于组），我需要data.table以其“原始”形式返回，具有基于日期的唯一行。到目前为止，我的解决方案是：

a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]

       date   value  group
1   2014-01-01  1   a
2   2014-01-02  2   a
3   2014-01-02  2   b
4   2014-01-03  3   b
5   2014-01-04  4   b
6   2014-01-05  5   a
7   2014-01-06  6   b

# creates new column with mean based on group
b[,factor := mean(value), by=group] 

#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]

date     value  group       factor
01/01/14    1   a           2.666666667
02/01/14    2   c("a", "b") 10
03/01/14    3   b           3.75
04/01/14    4   b           3.75
05/01/14    5   a           2.666666667
06/01/14    6   b           3.75

a一个选项是按行顺序分组，我们unlist
list

column（'group'），

paste

将

list

元素粘贴在一起（

toString（…）

），使用

cSplit

from

splitstackshape

和

direction='long'

将其重塑为“long”格式，然后使用“grp”作为分组变量，获得“value”列的

平均值
library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
#  grp       V1
#1:   a 2.666667
#2:   b 3.750000

刚刚意识到另一个使用splitstackshape
的选项是listcolu\l
，它将列表
列取消列出为长格式。由于输出是一个data.table
，我们可以使用data.table
方法来计算平均值。得到平均值
要简洁得多
 listCol_l(a, 'group')[, mean(value), group_ul]
 #  group_ul       V1
 #1:        a 2.666667
 #2:        b 3.750000


或者不使用splitstackshape
的另一个选项是通过list
元素的length
复制数据集的行。length
是sapply（group，length）
的方便包装器，速度更快。然后，我们通过取消从“a”数据集中列出原始“组”来更改“组”列，并获得“值”的平均值，按“组”分组
 a[rep(1:nrow(a), lengths(group))][,
        group:=unlist(a$group)][, mean(value), by = group]
 #  group       V1
 #1:     a 2.666667
 #2:     b 3.750000

伟大的据我所知，“cSplit”将具有两个组的行拆分为两个相同的行（一个用于第一个组，一个用于第二个组），然后我们可以轻松地使用normal data.table函数来计算平均值（）。非常好的解决方案@akrun-我对这个“splitstackshape”包一无所知…@RandomDude是的，它将包含多个元素的行拆分为单独的行。谢谢你的反馈。我在哪里可以找到“长度”函数？无法通过谷歌找到它，根据我的R Studio版本，它不可用…@RandomDude它是一个基本R
函数。我认为它是在R3.1.2中引入的。如果您有早期版本，则可以将其替换为sapply（组，长度）
# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date 
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')

library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
#  grp       V1
#1:   a 2.666667
#2:   b 3.750000

 listCol_l(a, 'group')[, mean(value), group_ul]
 #  group_ul       V1
 #1:        a 2.666667
 #2:        b 3.750000

 a[rep(1:nrow(a), lengths(group))][,
        group:=unlist(a$group)][, mean(value), by = group]
 #  group       V1
 #1:     a 2.666667
 #2:     b 3.750000