Parallel processing 朱莉娅:为什么';t共享内存多线程给我一个加速?
我想在Julia中使用共享内存多线程。正如Threads@Threads宏所做的那样,我可以使用ccall(:jl_threading_run…)来完成此操作。虽然我的代码现在并行运行,但我没有得到预期的加速 下面的代码旨在作为我所采用的方法和我所遇到的性能问题的一个简单示例:[编辑:请参阅后面的更简单示例]Parallel processing 朱莉娅:为什么';t共享内存多线程给我一个加速?,parallel-processing,julia,Parallel Processing,Julia,我想在Julia中使用共享内存多线程。正如Threads@Threads宏所做的那样,我可以使用ccall(:jl_threading_run…)来完成此操作。虽然我的代码现在并行运行,但我没有得到预期的加速 下面的代码旨在作为我所采用的方法和我所遇到的性能问题的一个简单示例:[编辑:请参阅后面的更简单示例] nthreads = Threads.nthreads() test_size = 1000000 println("STARTED with ", nthreads, " thread(
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
@time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
在i7-7700上,典型的单线程结果是:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
和4个线程:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
多线程会减慢速度!为什么?
编辑:可以在@threads宏本身创建一个更好的最小示例
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
@time Threads.@threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
我错误地认为@threads宏包含在Julia中意味着有好处。您遇到的问题很可能是 您可以通过如下方式将您写入的区域分开来解决此问题(这里是一个“快速而肮脏”的实现,以显示更改的本质): 或者像这样对局部变量进行每线程累加(这是首选方法,因为它应该更快):
还要注意的是,您可能在一台具有2个物理内核(只有4个虚拟内核)的机器上使用4个踏板,因此线程带来的收益将不是线性的。这些确实更快。奇怪的是,如果您的第一个解决方案从f()包装器中删除,并在for循环前面使用@time执行,则速度要慢得多。这与访问非
const
的全局变量有关。一般来说,在Julia中,您应该始终将代码封装在函数中(Julia允许其他样式,但在这种情况下,您可能会降低性能)。@Matt-谢谢。所以现在不是那么“快又脏”:@BogumiłKamiński:谢谢!这是一个很好的例子。我最初认为你的诊断是错误的,对你投了反对票,然后开始写一个竞争性的答案。。。结果发现我犯了一个愚蠢的错误,由于15分钟的计时器,我的否决票被锁定了。“修正”这个问题的最简单方法是稍微改进一下你的答案,这样我就可以投票了
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.@threads for i = 1 : test_size
@inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> @btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> @btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.@threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end