Performance Julia模型中的有效分层随机抽样_Performance_Julia_Sampling

Performance Julia模型中的有效分层随机抽样

performance julia

Performance Julia模型中的有效分层随机抽样,performance,julia,sampling,Performance,Julia,Sampling,我试图写一个小函数来做分层随机抽样。也就是说，每个元素都有一个组成员身份向量，我想为每个组选择一个元素（索引）。因此，输入是所需元素的数量，以及每个元素的组成员资格。输出是一个索引列表以下是我的功能： function stratified_sample(n::Int64, groups::Array{Int64}) # the output vector of indices ind = zeros(Int64, n) # first select n group

我试图写一个小函数来做分层随机抽样。也就是说，每个元素都有一个组成员身份向量，我想为每个组选择一个元素（索引）。因此，输入是所需元素的数量，以及每个元素的组成员资格。输出是一个索引列表

以下是我的功能：

function stratified_sample(n::Int64, groups::Array{Int64})

    # the output vector of indices
    ind = zeros(Int64, n)

    # first select n groups from the total set of possible groups
    group_samp = sample(unique(groups), n, replace = false)

    # cycle through the selected groups
    for i in 1:n
        # for each group, select one index whose group matches the current target group
        ind[i] = sample([1:length(groups)...][groups.==group_samp[i]], 1, replace = false)[1]
    end

    # return the indices
    return ind
end

当我在一个相对较大的向量上运行这段代码时，例如，1000个不同的组和40000个总条目，我得到


julia> groups = sample(1:1000, 40000, replace = true)
40000-element Array{Int64,1}:
 221
 431
 222
 421
 714
 108
 751
 259
   ⋮
 199
 558
 317
 848
 271
 358

julia> @time stratified_sample(5, groups)
  0.022951 seconds (595.06 k allocations: 19.888 MiB)
5-element Array{Int64,1}:
 11590
 17057
 17529
 25103
 20651

并将其与40000个样本中五种元素的正常随机抽样进行比较：

julia> @time sample(1:40000, 5, replace = false)
  0.000005 seconds (5 allocations: 608 bytes)
5-element Array{Int64,1}:
 38959
  5850
  3283
 19779
 30063

因此，我的代码运行速度慢了近50k倍，占用的内存也多了33k倍！我到底做错了什么，有没有办法加速这段代码？我的猜测是真正的减速发生在子集设置步骤中，即，

[1:length（groups）…][groups.==groupsamp[i]]

，但我找不到更好的解决方案

我在标准Julia软件包中不断地搜索这个函数，但运气不好

有什么建议吗

编辑：我只需随机抽取一个样本，检查它是否满足选择n个唯一组的要求，就可以大大加快速度：

function stratified_sample_random(n::Int64, groups::Array{Int64}, group_probs::Array{Float32})
    ind = zeros(Int64, n)
    my_samp = []
    while true
        my_samp = wsample(1:length(groups), group_probs, n, replace = false)
        if length(unique(groups[my_samp])) == n
            break
        end
    end

    return my_samp

end

这里，

groupprobs

只是一个抽样概率向量，其中每个组的元素的总概率为1/s，其中s是该组中元素的数量。例如，如果

组=[1,1,1,1,2,3,3]

则相应的概率为

组概率=[0.25,0.25,0.25,0.25,1,0.5,0.5]

。这有助于通过最小化从一组中选择多个项目的概率来加快采样速度。总的来说，它工作得相当好：

@time stratified_sample_random(5, groups, group_probs)
  0.000122 seconds (14 allocations: 1.328 KiB)
5-element Array{Int64,1}:
 32209
 10184
 30892
  4861
 30300

通过一点实验，概率加权采样不一定比标准采样（）快，但这取决于有多少个唯一组以及所需的

值

当然，不能保证此函数将随机采样一组唯一的对象，并且它可能永远循环。我的想法是在while循环中添加一个计数器，如果它尝试了10000次，但没有成功，那么它将调用我提供的原始

分层_sample

函数，以确保返回唯一的结果。我不喜欢这个解决方案，必须有一个更优雅、更节省的方法，但这绝对是一个进步

这里，

[1:length（groups）…]

，您正在浪费并分配

元素数组

次，您应该避免这种情况。这是一个速度快33倍的版本，使用范围

inds

。尽管了解了实际应用，我们仍然可以想出一种更快的方法

function stratified_sample(n::Int64, groups::Array{Int64})

    # the output vector of indices
    ind = zeros(Int64, n)

    # first select n groups from the total set of possible groups
    group_samp = sample(unique(groups), n, replace = false)

    inds = 1:length(groups)
    # cycle through the selected groups
    for i in 1:n
        # for each group, select one index whose group matches the current target group
        ind[i] = sample(inds[groups.==group_samp[i]], 1, replace = false)[1]
    end

    # return the indices
    return ind
end

在这里，

[1:length（groups）…]

，您正在浪费并分配一个

元素数组

次，您应该避免这种情况。这是一个速度快33倍的版本，使用范围

inds

。尽管了解了实际应用，我们仍然可以想出一种更快的方法

function stratified_sample(n::Int64, groups::Array{Int64})

    # the output vector of indices
    ind = zeros(Int64, n)

    # first select n groups from the total set of possible groups
    group_samp = sample(unique(groups), n, replace = false)

    inds = 1:length(groups)
    # cycle through the selected groups
    for i in 1:n
        # for each group, select one index whose group matches the current target group
        ind[i] = sample(inds[groups.==group_samp[i]], 1, replace = false)[1]
    end

    # return the indices
    return ind
end

或者，您可以为每个组保留一个元素（仅维护组样本和已访问的组元素），并在数据上运行一次。O（N）时间，O（组）空间。好吧，我可能弄错了。对于所选的

组，一个储液罐如何？对于每个取样组，一个单一元素储液罐如何。请查看感谢您提出的检查ML取样的建议。尽管如此，从远处看，这些函数不允许每个组只采样一个项目，这正是我在这里需要的。作为替代方案，您可以为每个组保留一个元素（仅维护组样本和已访问的组元素），并在数据上运行一次。O（N）时间，O（组）空间。好吧，我可能弄错了。对于所选的

组，一个储液罐如何？对于每个取样组，一个单一元素储液罐如何。请查看感谢您提出的检查ML取样的建议。尽管如此，据我所知，这些函数不允许每个组只采样一个项目，这是我在这里需要的。此外，

sample（…，1）[1]

可能相当于

rand（…）

谢谢。这当然会加快速度。我认为应用程序并没有提供更多的帮助——我有一堆对象，其中一些只有一种类型，一些有多种类型。我正在对一组引导样本进行采样，希望对随机样本应用一个函数。我需要固定对象的数量，因此每个随机样本只能选择一种类型的对象，每个引导样本只需要n个对象。此外，

sample（…，1）[1]

可能相当于

rand（…）

谢谢。这当然会加快速度。我认为应用程序并没有提供更多的帮助——我有一堆对象，其中一些只有一种类型，一些有多种类型。我正在对一组引导样本进行采样，希望对随机样本应用一个函数。我需要固定对象的数量，这样每个随机样本只能选择一种类型的对象，每个引导样本只需要n个对象。