Quick Optimizations in Julia for Performance: A Practical Example
January 19 2016 in Julia, MATLAB | Tags: AVX512, julia, performance, SIMD, threading | Author: Christopher Rackauckas
Let’s take a program which plots the standard logistic map:
r = 2.9:.00005:4; numAttract = 100; steady = ones(length(r),1)*.25; for i=1:300 ## Get to steady state steady = r.*steady.*(1-steady); end x = zeros(length(steady),numAttract); x[:,1] = steady; for i=2:numAttract ## Now grab some values at the attractor x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]); end using PyPlot; fig = figure(figsize=(20,10)); plot(collect(r),x,"b.",markersize=.06) savefig("plot.png",dpi=300);
This plots the logistic map. If you take the same code and change the array dereferencing from A[i,j] to A(i,j) you get MATLAB code. However, I couldn’t get MATLAB to have it looks as good, but that’s beside the point.
What I want to show are the steps for optimizing this code. In MATLAB, this is as vectorized as it gets and so you’re pretty much done. In Julia, there are a few steps that you can do from here. The first thing to notice that if this code is run directly, then all the variables are global scope which hurts performance, so the first thing we do is change this code to a function call. Next we note that we can improve performance by AVX2 which allows the processor to take in multiple numbers into the same input. Think about it as operating on vectors of 8 numbers at a time, and when AVX3 comes out, you will be able to do 16 numbers at a time. This is something you’ll want to take advantage of! Vectorized operations already have this implemented, so to implement it in our loops we simply add @simd before our for loops. Be careful when using simd: it will allows the loops to re-order the computation. Since each simulation (different r’s) is independent, we are fine. More about this can be found here.
This gives the following code:
function logisticPlot() tic() r = 2.9:.00005:4; numAttract = 100; steady = ones(length(r),1)*.25; @simd for i=1:400 ## Get to steady state @inbounds @fastmath steady = r.*steady.*(1-steady); end x = zeros(length(steady),numAttract); x[:,1] = steady; @simd for i=2:numAttract ## Now grab some values at the attractor @inbounds @fastmath x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]); end toc() fig = figure(figsize=(20,10)); plot(collect(r),x,"b.",markersize=.06) savefig("plot.png",dpi=300); end using PyPlot @time logisticPlot()
which does better than before. Here we can’t. Normally we can do even by explicitly defining the types which allows the LLVM to know the data types and use compiler optimizations like it was C code.
This is pretty much it because we cannot do @parallel because we need to ensure we are stepping in time without skipping steps (@parallel will re-order the timesteps). So, all in all, how well did we do? The final code takes .15 seconds on my machine to run (without the plotting). However, MATLAB is at .05. Why did MATLAB win? If you check the timings in more detail you’ll see that it all comes down to the inner loop steps still taking longer than MATLAB, and there is a clear reason for it: Julia does not have multithreading yet. If you check
A = rand(20000,20000); B = rand(20000,20000); A.*B
you will see that this uses all of the cores on your machine, whereas the same exact code in Julia does not (if you do A*B, Julia will because it sends that call to the Fortran program BLAS which is multi-threaded). So there it is, MATLAB wins because it’s using all 16 cores on my workstation, but Julia on 1 core holds up to it and makes a prettier plot. With this, I think Julia will win out in this example once it’s multi-threaded, but we’ll see.
What I really wanted to highlight here is that Julia is flexible enough to make both the prototype and the final code. You can code using simple syntax like in MATLAB to get things done, and then when you have a working product, start adding macros, parallelization, and type specifications to get something that both complies and runs like C. In MATLAB, you write a MATLAB code and when you get that working, you re-write all the heavy-lifting parts in C and link it up via MEX. This is quite a pain to do, and this itself is why I am investing time in Julia, knowing it will payout well.
Note: Julia’s threading branch has merged to the main branch already. They are just waiting for it to stabilize a bit to then put it in a standard release (with the major multi-threaded operations such as .* utilizing it). So stay tuned.
Edits
I am coming back to this after having learned some more Julia. As okvs pointed out (and I go into depth in a different blog post), it can be beneficial to devectorize since it will reduce the number of temporary variables. Okvs’s comment has a good devectorized version of the code. I wanted to point out that if you still wanted to keep the elegance of the vectorized code, you can use Devectorize.jl package like so:
function logisticPlot() tic() r = 2.9:.00005:4; numAttract = 100; steady = ones(length(r),1)*.25; for i=1:400 ## Get to steady state @devec steady = r.*steady.*(1-steady); end x = zeros(length(steady),numAttract); x[:,1] = steady; @simd for i=2:numAttract ## Now grab some values at the attractor @inbounds @fastmath x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]); end toc() fig = figure(figsize=(20,10)); plot(collect(r),x,"b.",markersize=.06) savefig("plot.png",dpi=300); end using PyPlot @time logisticPlot()
This will not be able to match the performance of fully devectorize (@devec won’t work on the second loop due to it being more complicated, and I don’t have it doing SIMD vectorization on the devectorized loop (two different kinds of vectorization…)), so this isn’t the best, but still, this gets you pretty far for little work.
Neil
says:Outstanding—thank you!
Neil
says:I found your post looking for ways to realize Julia’s promised speed—thanks for posting! I’m sure things have changed since your post in 2016, but here it is 2022 and I think new Julia users (like me) are running into a lot of the same problems in trying to realize Julia’s promised performance.
I ran your original code (without the plotting) and okvs’s function ten times each in fresh Julia 1.7 REPLs. Although the initial runs and average run time are definitely slower for the original, and there is more memory allocated, after ten runs through the same code the running times for the two versions are pretty similar.
I saw the same pattern when I tried to follow similar suggestions (in particular, to devectorize) to optimize my own code: memory allocation decreased, but overall speed was not noticeably different. Do you have any updated thoughts on this issue?
***
Details:
@time begin
r = 2.9:.00005:4;
numAttract = 100;
steady = ones(length(r), 1)*.25;
for i = 1:300 ## Get to steady state
steady = r.*steady.*(1 .- steady);
end
x = zeros(length(steady),numAttract);
x[:, 1] = steady;
for i = 2:numAttract ## Now grab some values at the attractor
x[:, i] = r .* x[:, i – 1] .* (1 .- x[:, i – 1]);
end
end
0.614575 seconds (1.40 M allocations: 190.435 MiB, 4.77% gc time, 84.39% compilation time)
0.158689 seconds (4.00 k allocations: 117.531 MiB, 57.36% gc time)
0.104670 seconds (4.00 k allocations: 117.531 MiB, 56.47% gc time)
0.053089 seconds (4.00 k allocations: 117.531 MiB, 4.37% gc time)
0.180362 seconds (4.00 k allocations: 117.531 MiB, 72.59% gc time)
0.051584 seconds (4.00 k allocations: 117.531 MiB, 4.45% gc time)
0.057686 seconds (4.00 k allocations: 117.531 MiB, 4.00% gc time)
0.245907 seconds (4.00 k allocations: 117.531 MiB, 71.26% gc time)
0.051242 seconds (4.00 k allocations: 117.531 MiB, 4.41% gc time)
0.052163 seconds (4.00 k allocations: 117.531 MiB, 4.51% gc time)
function logisticPlot_()
r = 2.9:.00005:4
numAttract = 100
steady = similar(r)
x = Array{Float64}(undef, length(steady), numAttract)
for (i, ri) in enumerate(r)
si = 0.25
for _ = 1:400 ## Get to steady state
si = ri * si * (1 – si)
end
steady[i] = x[i, 1] = si
end
@inbounds for j in 2:size(x, 2), i in 1:size(x, 1) ## Now grab some values at the attractor
xij = x[i, j-1]
x[i, j] = r[i] * xij * (1 – xij)
end
return (r, x)
end
@time logisticPlot_();
0.044854 seconds (4 allocations: 16.953 MiB)
0.051890 seconds (4 allocations: 16.953 MiB, 11.30% gc time)
0.069667 seconds (4 allocations: 16.953 MiB)
0.078508 seconds (4 allocations: 16.953 MiB)
0.055579 seconds (4 allocations: 16.953 MiB, 14.16% gc time)
0.045339 seconds (4 allocations: 16.953 MiB)
0.052361 seconds (4 allocations: 16.953 MiB)
0.101776 seconds (4 allocations: 16.953 MiB, 48.54% gc time)
0.093141 seconds (4 allocations: 16.953 MiB)
0.060678 seconds (4 allocations: 16.953 MiB)
Christopher Rackauckas
says:This blog post is really old. It’s from 2016 and has not been updated. You don’t need to devectorize to get performance and reduce allocations, just make sure to use views and .=. Here’s a non-allocating version:
0.063746 seconds (6 allocations: 17.121 MiB)
0.055894 seconds (6 allocations: 17.121 MiB)
0.062280 seconds (6 allocations: 17.121 MiB)
on a really slow laptop. You may want to instead check out the Optimizing Serial Code lecture of the SciML book, or the corresponding Youtube video which goes into detail on these things.
okvs
says:A couple of remarks:
1. You will not speed up anything by adding type annotations. That is a common misconception. The
types are automatically inferred.
2. You are losing a lot of time on implicitely defining temporary arrays through the use of
vectorized operations. If you write the code like this:
function logisticPlot_()
r = 2.9:.00005:4
numAttract = 100
steady = similar(r)
x = Array{Float64}(length(steady), numAttract)
for (i, ri) in enumerate(r)
si = 0.25
for _ = 1:400 ## Get to steady state
si = ri * si * (1 – si)
end
steady[i] = x[i, 1] = si
end
@inbounds for j in 2:size(x, 2), i in 1:size(x, 1) ## Now grab some values at the attractor
xij = x[i, j-1]
x[i, j] = r[i] * xij * (1 – xij)
end
return (r, x)
end
you mostly avoid temporary arrays. On my computer I see these execution times
> @time (r, x) = logisticPlot();
0.231214 seconds (23.06 k allocations: 302.368 MB, 16.72% gc time)
> @time (r, x) = logisticPlot_(); # new method
0.039998 seconds (15 allocations: 16.954 MB)
Notice the difference in allocations. I haven’t used @simd and @fastmath since they do not seem to
work on my computer, but you might try adding them on top of this.
yaakov
says:+1
Wouldn’t it be nice if those promoting Julia were to demonstrate the full set of features available. And when it comes to optimizing code @benchmark and @profile could be included.
Christopher Rackauckas
says:Good remarks. It’s been awhile since I wrote this and I’ve learned a lot since then. A few remarks in return:
1) Type annotations do help if you’re playing in the global namespace, i.e. directly into the interpreter. If you wrap it in a function then it doesn’t help. I didn’t know that before and I should clean this up.
2) Yes, I just checked too and devectorization helps here. In fact, you can use the Devectorize.jl’s macro @devec to automatically devectorize the first loop. It can’t handle the second one though.
I’ll update this sometime soon.
Srikiran
says:What about parfor in Matlab? I am not familiar with Julia and @simd, but it feels like it’s only fair to utilise the same functionality in Matlab as well!
Christopher Rackauckas
says:Parfor does not work for this problem. When you use parfor, you have to be able to solve it out of order in the looping variable. It’s clear in the second loop that since x[i+1] requires x[i] that if you try to parfor the results will not be correct (when going out of order, it will look at an x[i] that doesn’t have a value yet (i.e. zero), and use that…). A similar thing happens in the first loop, where say you have 4 processors do the computation to update steady, but because they all pull the same past value of steady only one of them update it, and so you solve the problem faster but would actually do a quarter of the steps.
In fact, if it was possible to parallelize this problem via parfor, then Julia has a similar construct which is @parallel. But for the same reasons as parfor, we cannot do this. SIMD is very different, the diagram on this page explains what @simd does and as you can see, it’s out of order but the ordering works for this problem.
Besides, MATLAB is already parallelized since .* is multi-threaded, sometimes it’s not beneficial to do parfor on operations like this even if you could.
Hope that makes sense.