Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/64.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R数据表-按不同抽样比例分组抽样_R_Data.table_Oversampling - Fatal编程技术网

R数据表-按不同抽样比例分组抽样

R数据表-按不同抽样比例分组抽样,r,data.table,oversampling,R,Data.table,Oversampling,我想从data.table中有效地按组随机抽取样本,但应该可以为每个组抽取不同比例的样本 如果我想从每组中抽取分数抽取分数,我可以从问答中得到启发,做如下事情: DT = data.table(a = sample(1:2), b = sample(1:1000,20)) group_sampler <- function(data, group_col, sample_fraction){ # this function samples sample_fraction <0,

我想从
data.table
中有效地按组随机抽取样本,但应该可以为每个组抽取不同比例的样本

如果我想从每组中抽取分数
抽取分数
,我可以从问答中得到启发,做如下事情:

DT = data.table(a = sample(1:2), b = sample(1:1000,20))

group_sampler <- function(data, group_col, sample_fraction){
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}

# what % of data should be sampled
sampling_fraction = 0.5

# perform the sampling
sampled_dt <- group_sampler(DT, 'a', sampling_fraction)
DT=数据表(a=样本(1:2),b=样本(1:1000,20))

group_sampler您可以使用
.GRP
,但要确保匹配正确的组。。您可能希望将
组列
定义为因子变量

group_sampler <- function(data, group_col, sample_fractions) {
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
  data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}
然后将
样本分数作为命名向量传递:

group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))

这里有一个使用查找表的选项(因此不依赖于向量或组的顺序)

库(data.table)
DT=数据表(组=样本(1:2),val=样本(1:1000,20))
样本_道具3:2680
#> 4:     2 613
#> 5:     2 170
#> 6:     2 175

由(v0.3.0)于2019-10-15创建。

如何定义哪一个是组1,哪一个是组2在上述示例中,“a”列的值为1和2。因此,a组和2组。我认为,为了确保为每个组分配了正确的采样分数,可以在函数的输入中使用命名向量或类似的东西。我只是不知道该怎么办it@chinsoon12你能举个简短的例子吗?你的确切意思是什么?
group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))