Julia 矩阵列上的内存有效排序项

Julia 矩阵列上的内存有效排序项,julia,Julia,我有一个很大的ish矩阵,我想将sortperm应用于该矩阵的每一列。天真的做法是 order = sortperm(X[:,j]) 这是一个副本。这似乎是一个耻辱,所以我想我应该尝试一个子阵列: order = sortperm(sub(X,1:n,j)) 但这甚至更慢。我想笑一笑 order = sortperm(1:n,by=i->X[i,j]) 但那当然很可怕。最快的方法是什么 以下是一些基准代码: getperm1(X,n,j) = sortperm(X[:,j]) get

我有一个很大的ish矩阵,我想将
sortperm
应用于该矩阵的每一列。天真的做法是

order = sortperm(X[:,j])
这是一个副本。这似乎是一个耻辱,所以我想我应该尝试一个
子阵列

order = sortperm(sub(X,1:n,j))
但这甚至更慢。我想笑一笑

order = sortperm(1:n,by=i->X[i,j])
但那当然很可怕。最快的方法是什么

以下是一些基准代码:

getperm1(X,n,j) = sortperm(X[:,j])
getperm2(X,n,j) = sortperm(sub(X,1:n,j))
getperm3(X,n) = mapslices(sortperm, X, 1)
n = 1000000
X = rand(n, 10)
for f in [getperm1, getperm2]
    println(f)
    for it in 1:5
        gc()
        @time f(X,n,5)
    end
end
for f in [getperm3]
    println(f)
    for it in 1:5
        gc()
        @time getperm3(X,n)
    end
end
结果:

getperm1
elapsed time: 0.258576164 seconds (23247944 bytes allocated)
elapsed time: 0.141448346 seconds (16000208 bytes allocated)
elapsed time: 0.137306078 seconds (16000208 bytes allocated)
elapsed time: 0.137385171 seconds (16000208 bytes allocated)
elapsed time: 0.139137529 seconds (16000208 bytes allocated)
getperm2
elapsed time: 0.433251141 seconds (11832620 bytes allocated)
elapsed time: 0.33970986 seconds (8000624 bytes allocated)
elapsed time: 0.339840795 seconds (8000624 bytes allocated)
elapsed time: 0.342436716 seconds (8000624 bytes allocated)
elapsed time: 0.342867431 seconds (8000624 bytes allocated)
getperm3
elapsed time: 1.766020534 seconds (257397404 bytes allocated, 1.55% gc time)
elapsed time: 1.43763525 seconds (240007488 bytes allocated, 1.85% gc time)
elapsed time: 1.41373546 seconds (240007488 bytes allocated, 1.82% gc time)
elapsed time: 1.42215519 seconds (240007488 bytes allocated, 1.83% gc time)
elapsed time: 1.419174037 seconds (240007488 bytes allocated, 1.83% gc time)
其中
mapsicles
版本是
getperm1
版本的10倍,正如您所期望的那样


值得指出的是,至少在我的机器上,copy+sortperm选项并不比相同长度的向量上的sortperm慢多少,但是不需要内存分配,所以最好避免它。

在一些非常特殊的情况下,您可以击败子阵列的性能(如连续查看
数组
)使用指针魔术:

function colview(X::Matrix,j::Int)
    n = size(X,1)
    offset = 1+n*(j-1) # The linear start position
    checkbounds(X, offset+n-1)
    pointer_to_array(pointer(X, offset), (n,))
end

getperm4(X,n,j) = sortperm(colview(X,j))
函数
colview
将返回一个完整的
数组
,该数组与原始
X
共享其数据。请注意,这是一个糟糕的想法,因为返回的数组引用的是Julia仅通过
X
跟踪的数据。这意味着如果
X
超出范围在列“查看”之前,数据访问将因segfault而崩溃

结果如下:

getperm1
elapsed time: 0.317923176 seconds (15 MB allocated)
elapsed time: 0.252215996 seconds (15 MB allocated)
elapsed time: 0.215124686 seconds (15 MB allocated)
elapsed time: 0.210062109 seconds (15 MB allocated)
elapsed time: 0.213339974 seconds (15 MB allocated)
getperm2
elapsed time: 0.509172302 seconds (7 MB allocated)
elapsed time: 0.509961218 seconds (7 MB allocated)
elapsed time: 0.506399583 seconds (7 MB allocated)
elapsed time: 0.512562736 seconds (7 MB allocated)
elapsed time: 0.506199265 seconds (7 MB allocated)
getperm4
elapsed time: 0.225968056 seconds (7 MB allocated)
elapsed time: 0.220587707 seconds (7 MB allocated)
elapsed time: 0.219854355 seconds (7 MB allocated)
elapsed time: 0.226289377 seconds (7 MB allocated)
elapsed time: 0.220391515 seconds (7 MB allocated)

我没有探究为什么子阵列的性能更差,但这可能只是因为每次访问内存时都会出现额外的指针解引用。非常值得注意的是,在时间方面,分配实际花费的成本是多么的少——getperm1的计时更加多变,但它仍然偶尔优于getperm4!我认为这是由于一些
Array
内部实现中的tra指针数学使用共享数据。还有一些疯狂的缓存行为…getperm1在重复运行时会显著加快。

mapslices(sortperm,X,1)如何执行?到目前为止您尝试的
sortperm
@时间吗?