Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/rust/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用sum、length和grep聚合data.table_R_Data.table_Aggregate - Fatal编程技术网

使用sum、length和grep聚合data.table

使用sum、length和grep聚合data.table,r,data.table,aggregate,R,Data.table,Aggregate,让我们创建一个data.table: dt <- data.table(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2)) dt x.1 x.2 x.3 vessel Year 1: 1 1 2 a 2012 2: 2 2 3 a 2013 3: 3 3 4 a 2014 4: 4 4 5

让我们创建一个data.table:

dt <- data.table(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
   x.1 x.2 x.3 vessel Year
1:   1   1   2      a 2012
2:   2   2   3      a 2013
3:   3   3   4      a 2014
4:   4   4   5      a 2015
5:   5   5   6      b 2012
6:   6   6   7      b 2013
7:   7   7   8      b 2014
8:   8   8   9      b 2015
这就是我想要的,但在我的真实数据中,我有很多列,所以我想使用grep或%like%,但我无法让它工作。我的想法与此一致:

dt[,grep("x",colnames(dt)),with = FALSE])

但是如何将其与聚合合并?

您可以使用
lappy
在所有(
.SD
)或多个列(使用
.SDcols
选择)上应用函数:

以下操作也可用于选择名称中带有“x”的所有列:

dt[, c(lapply(.SD, sum), vessel=uniqueN(vessel)),
    by=Year,
    .SDcols=grepl("^x", names(dt))
]

我不太明白你的问题,但是你想对grep做什么可以用这样的方法解决

dt <- data.frame(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
dt[unlist(lapply(colnames(dt),function(v){grepl("x",v)}))]

dt如果你真的需要它来提高效率:

> dt[, .SD
     ][, .N, .(vessel, Year)
     ][, .N, .(Year)
     ][, copy(dt)[.SD, vessels := i.N, on='Year']
     ][, vessel := NULL
     ][, melt(.SD, id.vars=c('Year', 'vessels'))
     ][, .(value=sum(value)), .(Year, vessels, variable)
     ][, dcast(.SD, ... ~ variable, value.var='value')
     ][, setcolorder(.SD, c(setdiff(colnames(.SD), 'vessels'), 'vessels'))
     ][order(Year)
     ]

   Year x.1 x.2 x.3 vessels
1: 2012   6   6   8       2
2: 2013   8   8  10       2
3: 2014  10  10  12       2
4: 2015  12  12  14       2
> 

如果你有很多列要聚合,那么考虑使用<代码> MELTER()/CUD>重新编译数据,并使用<代码> dCaster()/<代码>:

聚合可能是值得的。 最后需要使用联接来追加:

提示
  • measure.vars
    参数到
    melt()
    允许定义/选择/限制相关度量列
  • dcast()
    子集
    参数允许选择特定的测量变量或排除
  • 您可以在
    dcast()中使用多个聚合函数
这样可以做一些奇特的事情,如:

dcast(molten, Year ~ variable, list(mean, sum, max), subset = .(variable == "x.2")
      )[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
#   Year value_mean_x.2 value_sum_x.2 value_max_x.2 vessels
#1: 2012              3             6             5       2
#2: 2013              4             8             6       2
#3: 2014              5            10             7       2
#4: 2015              6            12             8       2

我认为
.SDcols
无疑是这里的关键,但我认为
dt[,c(lappy(.SD,sum),船只=uniqueN(船只)),by=Year,.SDcols=grepl(“^x”,names(dt))]
可能给出OP的确切结果requested@thelatemail请作为答案发布,然后我会接受:)仅供参考,
grep(patt,x,value=TRUE)
应该给出与
x[grep(patt,x)]相同的答案。
我根据最近的邮件建议更正了答案@在最近的邮件中,如果你想获得学分,请随意发布你自己的答案。@Stefan-一切都好-我已经有很多虚构的互联网点:-)为了节省打字时间,你可以按年写
by
而不是按年写
by=list(Year=dt$Year)
。这避免了对
dt
@UweBlock的双重引用。如果需要更多的by,则by=列表(Year=dt$Year)更容易扩展该函数。但我将删除dt$。
> dt[, .SD
     ][, .N, .(vessel, Year)
     ][, .N, .(Year)
     ][, copy(dt)[.SD, vessels := i.N, on='Year']
     ][, vessel := NULL
     ][, melt(.SD, id.vars=c('Year', 'vessels'))
     ][, .(value=sum(value)), .(Year, vessels, variable)
     ][, dcast(.SD, ... ~ variable, value.var='value')
     ][, setcolorder(.SD, c(setdiff(colnames(.SD), 'vessels'), 'vessels'))
     ][order(Year)
     ]

   Year x.1 x.2 x.3 vessels
1: 2012   6   6   8       2
2: 2013   8   8  10       2
3: 2014  10  10  12       2
4: 2015  12  12  14       2
> 
molten <- melt(dt, id.vars = c("Year", "vessel"))

molten
#    Year vessel variable value
# 1: 2012      a      x.1     1
# 2: 2013      a      x.1     2
# 3: 2014      a      x.1     3
# 4: 2015      a      x.1     4
# 5: 2012      b      x.1     5
# ...
#19: 2014      a      x.3     4
#20: 2015      a      x.3     5
#21: 2012      b      x.3     6
#22: 2013      b      x.3     7
#23: 2014      b      x.3     8
#24: 2015      b      x.3     9
#    Year vessel variable value

dcast(molten, Year ~ variable, sum)
#   Year x.1 x.2 x.3
#1: 2012   6   6   8
#2: 2013   8   8  10
#3: 2014  10  10  12
#4: 2015  12  12  14 
dt[, .(vessels = uniqueN(vessel)), Year]
#   Year vessels
#1: 2012       2
#2: 2013       2
#3: 2014       2
#4: 2015       2
dcast(molten, Year ~ variable, sum)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
#   Year x.1 x.2 x.3 vessels
#1: 2012   6   6   8       2
#2: 2013   8   8  10       2
#3: 2014  10  10  12       2
#4: 2015  12  12  14       2
dcast(molten, Year ~ variable, list(mean, sum, max), subset = .(variable == "x.2")
      )[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
#   Year value_mean_x.2 value_sum_x.2 value_max_x.2 vessels
#1: 2012              3             6             5       2
#2: 2013              4             8             6       2
#3: 2014              5            10             7       2
#4: 2015              6            12             8       2