可复制地将数据拆分为R中的培训和测试_R_Cross Validation_Sampling_Reproducible Research_Robustness

可复制地将数据拆分为R中的培训和测试

可复制地将数据拆分为R中的培训和测试,r,cross-validation,sampling,reproducible-research,robustness,R,Cross Validation,Sampling,Reproducible Research,Robustness,在R中采样/分割数据的常用方法是使用sample，例如在行号上。例如： require(data.table) set.seed(1) population <- as.character(1e5:(1e6-1)) # some made up ID names N <- 1e4 # sample size sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids

在R中采样/分割数据的常用方法是使用

sample

，例如在行号上。例如：

require(data.table)
set.seed(1)

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]

[1] 9999

然而，即使我们已经设定了种子，相同的行分割也会产生非常不同的测试集：

test2 <- sample2[test, .(id)]
nrow(test1)

[1] 2653

可以对特定的ID进行采样，但如果忽略或添加了观察结果，这将是不可靠的

有什么方法可以使拆分对数据的更改更具鲁棒性？也就是说，分配测试未更改的观察值，不分配丢弃的观察值，并重新分配新的观察值？

使用哈希函数并在其最后一位的mod上采样：

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

[1] 5057

nrow(test1a)

样本量不完全是5000，因为赋值是概率的，但由于大数定律，在大样本中不应该是问题

另见：和

nrow(merge(test1, test2))

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]

nrow(merge(test1a, test2a))

nrow(test1a)