R数据表-按不同抽样比例分组抽样
我想从R数据表-按不同抽样比例分组抽样,r,data.table,oversampling,R,Data.table,Oversampling,我想从data.table中有效地按组随机抽取样本,但应该可以为每个组抽取不同比例的样本 如果我想从每组中抽取分数抽取分数,我可以从问答中得到启发,做如下事情: DT = data.table(a = sample(1:2), b = sample(1:1000,20)) group_sampler <- function(data, group_col, sample_fraction){ # this function samples sample_fraction <0,
data.table
中有效地按组随机抽取样本,但应该可以为每个组抽取不同比例的样本
如果我想从每组中抽取分数抽取分数,我可以从问答中得到启发,做如下事情:
DT = data.table(a = sample(1:2), b = sample(1:1000,20))
group_sampler <- function(data, group_col, sample_fraction){
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}
# what % of data should be sampled
sampling_fraction = 0.5
# perform the sampling
sampled_dt <- group_sampler(DT, 'a', sampling_fraction)
DT=数据表(a=样本(1:2),b=样本(1:1000,20))
group_sampler您可以使用.GRP
,但要确保匹配正确的组。。您可能希望将组列
定义为因子变量
group_sampler <- function(data, group_col, sample_fractions) {
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}
然后将样本分数作为命名向量传递:
group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))
这里有一个使用查找表的选项(因此不依赖于向量或组的顺序)
库(data.table)
DT=数据表(组=样本(1:2),val=样本(1:1000,20))
样本_道具3:2680
#> 4: 2 613
#> 5: 2 170
#> 6: 2 175
由(v0.3.0)于2019-10-15创建。如何定义哪一个是组1,哪一个是组2在上述示例中,“a”列的值为1和2。因此,a组和2组。我认为,为了确保为每个组分配了正确的采样分数,可以在函数的输入中使用命名向量或类似的东西。我只是不知道该怎么办it@chinsoon12你能举个简短的例子吗?你的确切意思是什么?
group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))