R 字符向量的随机样本,元素之间没有前缀

R 字符向量的随机样本,元素之间没有前缀,r,performance,combinatorics,R,Performance,Combinatorics,考虑一个字符向量,pool,其元素是(零填充)二进制数,最多包含max\u len位 max_len <- 4 pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) pool ## [1] "0" "1" "00" "10" "01" "11" "000" "100" "010"

考虑一个字符向量,
pool
,其元素是(零填充)二进制数,最多包含
max\u len

max_len <- 4
pool <- unlist(lapply(seq_len(max_len), function(x) 
  do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))

pool
##  [1] "0"    "1"    "00"   "10"   "01"   "11"   "000"  "100"  "010"  "110" 
## [11] "001"  "101"  "011"  "111"  "0000" "1000" "0100" "1100" "0010" "1010"
## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111"
通过最初从
pool
中删除那些元素,这些元素的包含意味着
pool
中没有足够的元素来获取大小
n
的总样本,这可以稍微改善。例如,当
max_len=4
n>9
时,我们可以立即从
池中删除
0
1
,因为如果包括其中任何一个,最大样本数将为9(或者
0
和以
1
开头的八个4字符元素,或者
1
和以
0
开头的八个4字符元素)

基于此逻辑,在获取初始样本之前,我们可以省略
池中的元素,如下所示:

pool <- pool[
  nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)]
pool tail(其中(n>(2^ max_len-rev(2^(0:max_len))[-1]+1)),1)]
有谁能想出更好的方法吗?我觉得我忽略了更简单的东西


编辑

为了阐明我的意图,我将池描绘成一组分支,其中连接点和尖端是节点(池的元素)。假设下图中的黄色节点(即010)已绘制。现在,整个红色“分支”,它由节点0、01和010组成,将从池中删除。这就是我所说的禁止采样已在示例中作为“前缀”节点的节点(以及已由示例中的节点作为前缀的节点)

如果采样节点位于分支的中间位置,如下图中的01,则不允许使用所有红色节点(0、01、010和011),因为0前缀01和01前缀010和011

我的意思不是在每个交叉点取样1或0(即沿着树枝行走,在叉子上掷硬币)-只要样本中都有,只要:(1)节点的父母(或祖父母等)或子女(孙辈等)尚未取样;和(2)对节点进行采样后,将有足够的剩余节点来实现所需的大小样本
n


在上面的第二个图中,如果010是第一个选择,则黑色节点上的所有节点仍然(当前)有效,假设
n一种方法是简单地使用迭代方法生成所有可能的适当大小的元组:

  • 构建大小为1的所有元组(池中的所有元素)
  • 池中元素的叉积
  • 多次删除使用
    池的同一元素的任何元组
  • 删除另一个元组的任何精确副本
  • 删除任何不能一起使用的元组对
  • 冲洗并重复,直到得到合适的元组大小
  • 对于给定的大小(
    pool
    长度为30,
    max\u len
    4),这是可运行的:


    get.template如果您不想生成所有可能元组的集合,然后随机采样(正如您所注意到的,对于较大的输入大小可能不可行),另一个选项是使用整数规划绘制单个样本。基本上,您可以为
    池中的每个元素分配一个随机值,然后选择具有最大值和的可行元组。这将使每个元组被选择的概率相等,因为它们的大小都相同,并且它们的值都被选择模型的约束将确保没有选择任何不允许的元组对,并且选择了正确数量的元素

    下面是一个使用
    lpSolve
    包的解决方案:

    library(lpSolve)
    sample.lp <- function(pool, max_len) {
      pool <- sort(pool)
      pml <- max(nchar(pool))
      runs <- c(rev(cumsum(2^(seq(pml-1)))), 0)
      banned.from <- rep(seq(pool), runs[nchar(pool)])
      banned.to <- banned.from + unlist(lapply(runs[nchar(pool)], seq_len))
      banned.constr <- matrix(0, nrow=length(banned.from), ncol=length(pool))
      banned.constr[cbind(seq(banned.from), banned.from)] <- 1
      banned.constr[cbind(seq(banned.to), banned.to)] <- 1
      mod <- lp(direction="max",
                objective.in=runif(length(pool)),
                const.mat=rbind(banned.constr, rep(1, length(pool))),
                const.dir=c(rep("<=", length(banned.from)), "=="),
                const.rhs=c(rep(1, length(banned.from)), max_len),
                all.bin=TRUE)
      pool[which(mod$solution == 1)]
    }
    set.seed(144)
    pool <- unlist(lapply(seq_len(4), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
    sample.lp(pool, 4)
    # [1] "0011" "010"  "1000" "1100"
    sample.lp(pool, 8)
    # [1] "0000" "0100" "0110" "1001" "1010" "1100" "1101" "1110"
    
    库(lpSolve)
    
    sample.lp您可以对池进行排序,以帮助决定取消哪些元素的资格。例如,查看三元素排序池:

     [1] "0"   "00"  "000" "001" "01"  "010" "011" "1"   "10"  "100" "101" "11" 
    [13] "110" "111"
    
    我可以看出,我可以取消我所选项目后面的任何项目的资格,这些项目的字符数比我的项目的字符数多,直到第一个项目的字符数相同或更少。例如,如果我选择“01”,我可以立即看到接下来的两个项目(“010”、“011”)需要删除,但后面的项目不需要删除,因为“1”具有更少的字符。之后删除“0”很容易。以下是一个实现:

    library(fastmatch)  # could use `match`, but we repeatedly search against same hash
    
    # `pool` must be sorted!
    
    sample01 <- function(pool, n) {
      picked <- logical(length(pool))
      chrs <- nchar(pool)
      pick.list <- character(n)
      pool.seq <- seq_along(pool)
    
      for(i in seq(n)) {
        # Make sure pool not exhausted
    
        left <- which(!picked)
        left.len <- length(left)
        if(!length(left)) break
    
        # Sample from pool
    
        seq.left <- seq.int(left)
        pool.left <- pool[left]
        chrs.left <- chrs[left]
        pick <- sample(length(pool.left), 1L)
    
        # Find all the elements with more characters that are disqualified
        # and store their indices in `valid` (bad name...)
    
        valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick
        first.invalid <- which(!valid.tmp & seq.left > pick)
        valid <- if(length(first.invalid)) {
          pick:(first.invalid[[1L]] - 1L)
        } else pick:left.len
    
        # Translate back to original pool indices since we're working on a 
        # subset in `pool.left`
    
        pool.seq.left <- pool.seq[left]
        pool.idx <- pool.seq.left[valid]
        val <- pool[[pool.idx[[1L]]]]
    
        # Record the picked value, and all the disqualifications
    
        pick.list[[i]] <- val
        picked[pool.idx] <- TRUE
    
        # Disqualify shorter matches
    
        to.rem <- vapply(
          seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L
        )
        to.rem.idx <- fmatch(to.rem, pool, nomatch=0)
        picked[to.rem.idx] <- TRUE  
      }
      pick.list  
    }
    
    请注意,在最后一种情况下,我们得到了所有3位二进制组合(2^3),因为我们碰巧一直从3位二进制组合中采样。此外,仅3个大小的池中有许多采样会阻止完整的8次绘制;您可以通过建议消除阻止从池中完全绘制的组合来解决这一问题

    这是非常快的。请看
    max_len==9
    示例,使用替代解决方案需要2秒钟:

    pool9 <- make_pool(9)
    microbenchmark(sample01(pool9, 4))
    # Unit: microseconds
    #                expr     min      lq  median      uq     max neval
    #  sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663   100    
    
    pool9简介
    这是我们在另一个答案中实现的字符串算法的数字变体。它更快,不需要创建或排序池

    算法大纲 我们可以使用整数来表示二进制字符串,这大大简化了生成池和顺序消除值的问题。例如,使用
    max_len==3
    ,我们可以使用数字
    1--
    (其中
    -
    表示填充)在十进制中表示
    4
    。此外,我们可以确定,如果我们选择这个数字,需要消除的数字是
    4
    4+2^x-1
    之间的数字。这里
    x
    是填充元素的数量(在本例中为2),因此要消除的数字介于
    4
    4+2^2-1
    之间(或介于
    4
    7
    之间,表示为
    100
    110
    111

    为了精确地匹配您的问题,我们需要一点折痕,因为您将二进制中可能相同的数字视为算法某些部分的不同数字。例如,
    100 [1] "0"   "00"  "000" "001" "01"  "010" "011" "1"   "10"  "100" "101" "11" 
    [13] "110" "111"
    
    library(fastmatch)  # could use `match`, but we repeatedly search against same hash
    
    # `pool` must be sorted!
    
    sample01 <- function(pool, n) {
      picked <- logical(length(pool))
      chrs <- nchar(pool)
      pick.list <- character(n)
      pool.seq <- seq_along(pool)
    
      for(i in seq(n)) {
        # Make sure pool not exhausted
    
        left <- which(!picked)
        left.len <- length(left)
        if(!length(left)) break
    
        # Sample from pool
    
        seq.left <- seq.int(left)
        pool.left <- pool[left]
        chrs.left <- chrs[left]
        pick <- sample(length(pool.left), 1L)
    
        # Find all the elements with more characters that are disqualified
        # and store their indices in `valid` (bad name...)
    
        valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick
        first.invalid <- which(!valid.tmp & seq.left > pick)
        valid <- if(length(first.invalid)) {
          pick:(first.invalid[[1L]] - 1L)
        } else pick:left.len
    
        # Translate back to original pool indices since we're working on a 
        # subset in `pool.left`
    
        pool.seq.left <- pool.seq[left]
        pool.idx <- pool.seq.left[valid]
        val <- pool[[pool.idx[[1L]]]]
    
        # Record the picked value, and all the disqualifications
    
        pick.list[[i]] <- val
        picked[pool.idx] <- TRUE
    
        # Disqualify shorter matches
    
        to.rem <- vapply(
          seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L
        )
        to.rem.idx <- fmatch(to.rem, pool, nomatch=0)
        picked[to.rem.idx] <- TRUE  
      }
      pick.list  
    }
    
    make_pool <- function(size)
      sort(
        unlist(
          lapply(
            seq_len(size), 
            function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))) 
      ) ) )
    
    pool3 <- make_pool(3)
    set.seed(1)
    sample01(pool3, 8)
    # [1] "001" "1"   "010" "011" "000" ""    ""    ""   
    sample01(pool3, 8)
    # [1] "110" "111" "011" "10"  "00"  ""    ""    ""   
    sample01(pool3, 8)
    # [1] "000" "01"  "11"  "10"  "001" ""    ""    ""   
    sample01(pool3, 8)
    # [1] "011" "101" "111" "001" "110" "100" "000" "010"    
    
    pool9 <- make_pool(9)
    microbenchmark(sample01(pool9, 4))
    # Unit: microseconds
    #                expr     min      lq  median      uq     max neval
    #  sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663   100    
    
    pool16 <- make_pool(16)  # 131K entries
    system.time(sample01(pool16, 100))
    #  user  system elapsed 
    # 3.407   0.146   3.552 
    
    0 - 000: 0--, 00-
    1 - 001:
    2 - 010: 01-
    3 - 011:
    4 - 100: 1--, 10-
    5 - 101:
    6 - 110: 11-
    7 - 111:
    
    jbaum | int | bin | bin.enc | int.enc    
      0-- |   0 | 000 |   00000 |       0
      00- |   0 | 000 |   00001 |       1      
      000 |   0 | 000 |   00010 |       2      
      001 |   1 | 001 |   00100 |       3      
      01- |   2 | 010 |   01000 |       4  
      010 |   2 | 010 |   01001 |       5  
      011 |   3 | 011 |   01101 |       6  
      1-- |   4 | 100 |   10000 |       7  
      10- |   4 | 100 |   10001 |       8  
      100 |   4 | 100 |   10010 |       9  
      101 |   5 | 101 |   10100 |      10  
      11- |   6 | 110 |   11000 |      11   
      110 |   6 | 110 |   11001 |      12   
      111 |   7 | 111 |   11100 |      13
    
    # each column represents a draw from a `max_len==4` pool
    
    set.seed(6); replicate(10, sample0110b(4, 8))
         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]   [,10] 
    [1,] "1000" "1"    "0011" "0010" "100"  "0011" "0"    "011"  "0100" "1011"
    [2,] "111"  "0000" "1101" "0000" "0110" "0100" "1000" "00"   "0101" "1001"
    [3,] "0011" "0110" "1001" "0100" "0000" "0101" "1101" "1111" "10"   "1100"
    [4,] "0100" "0010" "0000" "0101" "1101" "101"  "1011" "1101" "0110" "1101"
    [5,] "101"  "0100" "1100" "1100" "0101" "1001" "1001" "1000" "1111" "1111"
    [6,] "110"  "0111" "1011" "111"  "1011" "110"  "1111" "0100" "0011" "000" 
    [7,] "0101" "0101" "111"  "011"  "1010" "1000" "1100" "101"  "0001" "0101"
    [8,] "011"  "0001" "01"   "1010" "0011" "1110" "1110" "1001" "110"  "1000"
    
       size    n  jbaum josilber  frank tensibai brodie.b brodie brodie.C brodie.str
    1     4   10     11        1      3        1        1      1        1          0
    2     4   50      -        -      -        1        -      -        -          1
    3     4  100      -        -      -        1        -      -        -          0
    4     4  256      -        -      -        1        -      -        -          1
    5     4 1000      -        -      -        1        -      -        -          1
    6     8   10      1      290      6        3        2      2        1          1
    7     8   50    388        -      8        8        3      4        3          4
    8     8  100  2,506        -     13       18        6      7        5          5
    9     8  256      -        -     22       27       13     14       12          6
    10    8 1000      -        -      -       27        -      -        -          7
    11   16   10      -        -    615      688       31     61       19        424
    12   16   50      -        -  2,123    2,497       28    276       19      1,764
    13   16  100      -        -  4,202    4,807       30    451       23      3,166
    14   16  256      -        - 11,822   11,942       40  1,077       43      8,717
    15   16 1000      -        - 38,132   44,591       83  3,345      130     27,768
    
    system.time(sample0110b(18, 100000))
       user  system elapsed 
      8.441   0.079   8.527 
    
    # some key objects
    
    n_pool      = sum(2^(1:max_len))      # total number of indices
    cuts        = cumsum(2^(1:max_len-1)) # new group starts
    inds_by_g   = mapply(seq,cuts,cuts*2) # indices grouped by length
    
    # the mapping to strings (one among many possibilities)
    
    library(data.table)
    get_01str <- function(id,max_len){
        cuts = cumsum(2^(1:max_len-1))
        g    = findInterval(id,cuts)
        gid  = id-cuts[g]+1
    
        data.table(g,gid)[,s:=
          do.call(paste,c(list(sep=""),lapply(
            seq(g[1]), 
            function(x) (gid-1) %/% 2^(x-1) %% 2
          )))
        ,by=g]$s      
    } 
    
     # the mapping from one index to indices of nixed strings
    
    get_nixstrs <- function(g,gid,max_len){
    
        cuts         = cumsum(2^(1:max_len-1))
        gids_child   = {
          x = gid%%2^sequence(g-1)
          ifelse(x,x,2^sequence(g-1))
        }
        ids_child    = gids_child+cuts[sequence(g-1)]-1
    
        ids_parent   = if (g==max_len) gid+cuts[g]-1 else {
    
          gids_par       = vector(mode="list",max_len)
          gids_par[[g]]  = gid
          for (gg in seq(g,max_len-1)) 
            gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg)
    
          unlist(mapply(`+`,gids_par,cuts-1))
        }
    
        c(ids_child,ids_parent)
    }
    
    drawem <- function(n,max_len){
        cuts        = cumsum(2^(1:max_len-1))
        inds_by_g   = mapply(seq,cuts,cuts*2)
    
        oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ]
        okinds = unlist(inds_by_g[oklens])
    
        mysamp = rep(0,n)
        for (i in 1:n){
    
            id        = if (length(okinds)==1) okinds else sample(okinds,1)
            g         = findInterval(id,cuts)
            gid       = id-cuts[g]+1
            nixed     = get_nixstrs(g,gid,max_len)
    
            # print(id); print(okinds); print(nixed)
    
            mysamp[i] = id
            okinds    = setdiff(okinds,nixed)
            if (!length(okinds)) break
        }
    
        res <- rep("",n)
        res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len)
        res
    }
    
    # how the indices line up
    
    n_pool = sum(2^(1:max_len)) 
    pdt <- data.table(id=1:n_pool)
    pdt[,g:=findInterval(id,cuts)]
    pdt[,gid:=1:.N,by=g]
    pdt[,s:=get_01str(id,max_len)]
    
    # example run
    
    set.seed(4); drawem(5,5)
    # [1] "01100" "1"     "0001"  "0101"  "00101"
    
    set.seed(4); drawem(8,4)
    # [1] "1100" "0"    "111"  "101"  "1101" "100"  ""     ""  
    
    require(rbenchmark)
    max_len = 8
    n = 8
    
    benchmark(
          jos_lp     = {
            pool <- unlist(lapply(seq_len(max_len),
              function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
            sample.lp(pool, n)},
          bro_string = {pool <- make_pool(max_len);sample01(pool,n)},
          fra_num    = drawem(n,max_len),
          replications=5)[1:5]
    #         test replications elapsed relative user.self
    # 2 bro_string            5    0.05      2.5      0.05
    # 3    fra_num            5    0.02      1.0      0.02
    # 1     jos_lp            5    1.56     78.0      1.55
    
    n = 12
    max_len = 12
    benchmark(
      bro_string={pool <- make_pool(max_len);sample01(pool,n)},
      fra_num=drawem(n,max_len),
      replications=5)[1:5]
    #         test replications elapsed relative user.self
    # 1 bro_string            5    0.54     6.75      0.51
    # 2    fra_num            5    0.08     1.00      0.08
    
    jos_enum = {pool <- unlist(lapply(seq_len(max_len), 
        function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
      get.template(pool, n)}
    bro_num  = sample011(max_len,n)    
    
      i  depth    n  time (ms)
      1      4   10  0.182511806488
      2      4   50  --   
      3      4  100  --   
      4      4  150  --   
      5      8   10  0.397620201111
      6      8   50  1.66054964066
      7      8  100  2.90236949921
      8      8  150  3.48146915436
      9     15   10  0.804011821747
     10     15   50  3.7428188324
     11     15  100  7.34910964966
     12     15  150  10.8230614662
     13     16   10  0.804491043091
     14     16   50  3.66818904877
     15     16  100  7.09567070007
     16     16  150  10.404779911
     17     17   10  0.865840911865
     18     17   50  3.9999294281
     19     17  100  7.70257949829
     20     17  150  11.3758206367
     21     18   10  0.915451049805
     22     18   50  4.22935962677
     23     18  100  8.22361946106
     24     18  150  12.2081303596
    
    ['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010']
    ['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111']
    ['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101']
    ['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110']
    ['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110']
    ['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101']
    ['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111']
    ['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111']
    ['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100']
    ['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000']
    
    library(microbenchmark)
    library(lineprof)
    
    max_len <- 16
    pool <- unlist(lapply(seq_len(max_len), function(x) 
      do.call(paste0, expand.grid(rep(list(c('0', '1')), x)))))
    n<-100
    
    library(stringr)
    tree_sample <- function(samples,pool) {
      results <- vector("integer",samples)
      # Will be used on a regular basis, compute it in advance
      PoolLen <- str_length(pool)
      # Make a mask vector based on the length of each entry of the pool
      masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2)
    
      # Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8
      # This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively
      # once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it's a parent.
      integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2)
    
      # Create a vector to filter the available value to sample
      ok <- rep(TRUE,length(pool))
    
      #Precompute the result of the bitwise and betwwen our integer pool and the masks   
      MaskedPool <- bitwAnd(integerPool,masks)
    
      while(samples) {
        samp <- sample(pool[ok],1) # Get a sample
        results[samples] <- samp # Store it as result
        ok[pool == samp] <- FALSE # Remove it from available entries
    
        vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample
        mlen <- str_length(samp) # Get sample len
    
        #Creation of unitary mask to remove childs of sample
        mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2)
    
        # Get the result of bitwise And between the integerPool and the sample mask 
        FilterVec <- bitwAnd(integerPool,mask)
    
        # Get the bitwise and result of the sample and it's mask
        Childm <- bitwAnd(vsamp,mask)
    
        ok[FilterVec == Childm] <- FALSE  # Remove from available entries the childs of the sample
        ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching
    
        samples <- samples -1
      }
      print(results)
    }
    microbenchmark(tree_sample(n,pool),times=10L)
    
    Let x be the array index.
    x = 0 is the root of the entire tree
    left_child(x) = 2x + 1
    right_child(x) = 2x + 2
    parent(x) = floor((n-1)/2)
    
    #include <stdint.h>
    #include <algorithm>
    #include <cmath>
    #include <list>
    #include <deque>
    #include <ctime>
    #include <cstdlib>
    #include <iostream>
    
    /*
     * A range of values of the form (a, b), where a <= b, and is inclusive.
     * Ex (1,1) is the range from 1 to 1 (ie: just 1)
     */
    class Range
    {
    private:
        friend bool operator< (const Range& lhs, const Range& rhs);
        friend std::ostream& operator<<(std::ostream& os, const Range& obj);
    
        int64_t m_start;
        int64_t m_end;
    
    public:
        Range(int64_t start, int64_t end) : m_start(start), m_end(end) {}
        int64_t getStart() const { return m_start; }
        int64_t getEnd() const { return m_end; }
        int64_t size() const { return m_end - m_start + 1; }
        bool canMerge(const Range& other) const {
            return !((other.m_start > m_end + 1) || (m_start > other.m_end + 1));
        }
        int64_t merge(const Range& other) {
            int64_t change = 0;
            if (m_start > other.m_start) {
                change += m_start - other.m_start;
                m_start = other.m_start;
            }
            if (other.m_end > m_end) {
                change += other.m_end - m_end;
                m_end = other.m_end;
            }
            return change;
        }
    };
    
    inline bool operator< (const Range& lhs, const Range& rhs){return lhs.m_start < rhs.m_start;}
    std::ostream& operator<<(std::ostream& os, const Range& obj) {
        os << '(' << obj.m_start << ',' << obj.m_end << ')';
        return os;
    }
    
    /*
     * Stuct to allow returning of multiple values
     */
    struct NodeInfo {
        int64_t subTreeSize;
        int64_t depth;
        std::list<int64_t> ancestors;
        std::string representation;
    };
    
    /*
     * Collection of functions representing a complete binary tree
     * as an array created using pre-order depth-first search,
     * with 0 as the root.
     * Depth of the root is defined as 0.
     */
    class Tree
    {
    private:
        int64_t m_depth;
    public:
        Tree(int64_t depth) : m_depth(depth) {}
        int64_t size() const {
            return (int64_t(1) << (m_depth+1))-1;
        }
        int64_t getDepthOf(int64_t node) const{
            if (node == 0) { return 0; }
            int64_t searchDepth = m_depth;
            int64_t currentDepth = 1;
            while (true) {
                int64_t rightChild = int64_t(1) << searchDepth;
                if (node == 1 || node == rightChild) {
                    break;
                } else if (node > rightChild) {
                    node -= rightChild;
                } else {
                    node -= 1;
                }
                currentDepth += 1;
                searchDepth -= 1;
            }
            return currentDepth;
        }
        int64_t getSubtreeSizeOf(int64_t node, int64_t nodeDepth = -1) const {
            if (node == 0) {
                return size();
            }
            if (nodeDepth == -1) {
                nodeDepth = getDepthOf(node);
            }
            return (int64_t(1) << (m_depth + 1 - nodeDepth)) - 1;
        }
        int64_t getLeftChildOf(int64_t node, int64_t nodeDepth = -1) const {
            if (nodeDepth == -1) {
                nodeDepth = getDepthOf(node);
            }
            if (nodeDepth == m_depth) { return -1; }
            return node + 1;
        }
        int64_t getRightChildOf(int64_t node, int64_t nodeDepth = -1) const {
            if (nodeDepth == -1) {
                nodeDepth = getDepthOf(node);
            }
            if (nodeDepth == m_depth) { return -1; }
            return node + 1 + ((getSubtreeSizeOf(node, nodeDepth) - 1) / 2);
        }
        NodeInfo getNodeInfo(int64_t node) const {
            NodeInfo info;
            int64_t depth = 0;
            int64_t currentNode = 0;
            while (currentNode != node) {
                if (currentNode != 0) {
                    info.ancestors.push_back(currentNode);
                }
                int64_t rightChild = getRightChildOf(currentNode, depth);
                if (rightChild == -1) {
                    break;
                } else if (node >= rightChild) {
                    info.representation += '1';
                    currentNode = rightChild;
                } else {
                    info.representation += '0';
                    currentNode = getLeftChildOf(currentNode, depth);
                }
                depth++;
            }
            info.depth = depth;
            info.subTreeSize = getSubtreeSizeOf(node, depth);
            return info;
        }
    };
    
    // random selection amongst remaining allowed nodes
    int64_t selectNode(const std::deque<Range>& eliminationList, int64_t poolSize, std::mt19937_64& randomGenerator)
    {
        std::uniform_int_distribution<> randomDistribution(1, poolSize);
        int64_t selection = randomDistribution(randomGenerator);
        for (auto const& range : eliminationList) {
            if (selection >= range.getStart()) { selection += range.size(); }
            else { break; }
        }
        return selection;
    }
    
    // determin how many nodes have been elimintated
    int64_t countEliminated(const std::deque<Range>& eliminationList)
    {
        int64_t count = 0;
        for (auto const& range : eliminationList) {
            count += range.size();
        }
        return count;
    }
    
    // merge all the elimination ranges to listA, and return the number of new elimintations
    int64_t mergeEliminations(std::deque<Range>& listA, std::deque<Range>& listB) {
        if(listB.empty()) { return 0; }
        if(listA.empty()) {
            listA.swap(listB);
            return countEliminated(listA);
        }
    
        int64_t newEliminations = 0;
        int64_t x = 0;
        auto listA_iter = listA.begin();
        auto listB_iter = listB.begin();
        while (listB_iter != listB.end()) {
            if (listA_iter == listA.end()) {
                listA_iter = listA.insert(listA_iter, *listB_iter);
                x = listB_iter->size();
                assert(x >= 0);
                newEliminations += x;
                ++listB_iter;
            } else if (listA_iter->canMerge(*listB_iter)) {
                x = listA_iter->merge(*listB_iter);
                assert(x >= 0);
                newEliminations += x;
                ++listB_iter;
            } else if (*listB_iter < *listA_iter) {
                listA_iter = listA.insert(listA_iter, *listB_iter) + 1;
                x = listB_iter->size();
                assert(x >= 0);
                newEliminations += x;
                ++listB_iter;
            } else if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
                listA_iter->merge(*(listA_iter+1));
                listA_iter = listA.erase(listA_iter+1);
            } else {
                ++listA_iter;
            }
        }
        while (listA_iter != listA.end()) {
            if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) {
                listA_iter->merge(*(listA_iter+1));
                listA_iter = listA.erase(listA_iter+1);
            } else {
                ++listA_iter;
            }
        }
        return newEliminations;
    }
    
    int main (int argc, char** argv)
    {
        std::random_device rd;
        std::mt19937_64 randomGenerator(rd());
    
        int64_t max_len = std::stoll(argv[1]);
        int64_t num_samples = std::stoll(argv[2]);
    
        int64_t samplesRemaining = num_samples;
        Tree tree(max_len);
        int64_t poolSize = tree.size() - 1;
        std::deque<Range> eliminationList;
        std::deque<Range> eliminated;
        std::list<std::string> foundList;
    
        while (samplesRemaining > 0 && poolSize > 0) {
            // find a valid node
            int64_t selectedNode = selectNode(eliminationList, poolSize, randomGenerator);
            NodeInfo info = tree.getNodeInfo(selectedNode);
            foundList.push_back(info.representation);
            samplesRemaining--;
    
            // determine which nodes this choice eliminates
            eliminated.clear();
            for( auto const& ancestor : info.ancestors) {
                Range r(ancestor, ancestor);
                if(eliminated.empty() || !eliminated.back().canMerge(r)) {
                    eliminated.push_back(r);
                } else {
                    eliminated.back().merge(r);
                }
            }
            Range r(selectedNode, selectedNode + info.subTreeSize - 1);
            if(eliminated.empty() || !eliminated.back().canMerge(r)) {
                eliminated.push_back(r);
            } else {
                eliminated.back().merge(r);
            }
    
            // add the eliminated nodes to the existing list
            poolSize -= mergeEliminations(eliminationList, eliminated);
        }
    
        // Print some stats
        // std::cout << "tree: " << tree.size() << " samplesRemaining: "
        //                       << samplesRemaining << " poolSize: "
        //                       << poolSize << " samples: " << foundList.size()
        //                       << " eliminated: "
        //                       << countEliminated(eliminationList) << std::endl;
    
        // Print list of binary strings
        // std::cout << "list:";
        // for (auto const& s : foundList) {
        //  std::cout << " " << s;
        // }
        // std::cout << std::endl;
    }