Optimal Number of Workers for Parallel Julia
April 16 2016 in HPC, Julia, Programming, Stochastics | Tags: BLAS, hyperthreading, julia, parallel computing, workers | Author: Christopher Rackauckas
How many workers do you choose when running a parallel job in Julia? The answer is easy right? The number of physical cores. We always default to that number. For my Core i7 4770K, that means it’s 4, not 8 since that would include the hyperthreads. On my FX8350, there are 8 cores, but only 4 floating-point units (FPUs) which do the math, so in mathematical projects, I should use 4, right? I want to demonstrate that it’s not that simple.
Where the Intuition Comes From
Most of the time when doing scientific computing you are doing parallel programming without even knowing it. This is because a lot of vectorized operations are “implicitly paralleled”, meaning that they are multi-threaded behind the scenes to make everything faster. In other languages like Python, MATLAB, and R, this is also the case. Fire up MATLAB and run
A = randn(10000,10000) B = randn(10000,10000) A.*B
and you will see that all of your cores are used. Threads are a recent introduction to Julia, and so in version 0.5 this will also be the case.
Another large place where implicit parallelization comes up is in linear algebra. When one uses a matrix multiplication, it is almost surely calling an underlying program which is an implementation of BLAS. BLAS (Basic Linear Algebra Subroutines) is aptly named just a set of functions for solving linear algebra problems. These are written in either C or Fortran and are heavily optimized. They are well-studied and many smart people have meticulously crafted “perfect code” which minimizes cache misses and all of that other low level stuff, all to make this very common operation run smoothly.
Because BLAS (and LINPACK, Linear Algebra Package, for other linear algebra routines) is so optimized, people say you should always make sure that it knows exactly how many “real” processors it has to work with. So in my case, with a Core i7 with 4 physical cores and 4 from hyperthreading, forget the hyperthreading and thus there are 4. With the FX8350, there are only 4 processors for doing math, so 4 threads. Check to make sure this is best.
What about for your code?
Most likely this does not apply to your code. You didn’t carefully manage all of your allocations and tell the compiler what needs to be cached etc. You just wrote some beautiful Julia code. So how many workers do you choose?
Let’s take my case. I have 4 real cores, do I choose 4? Or do I make 3 workers to allow for 1 to “command” the others freely? Or do I make 7/8 due to hyperthreading?
I decided to test this out on a non-trivial example which is similar to what’s shown in this benchmark (but on a similar problem. I forget which SDE this was because that notebook got deleted, but it’s some simple SDE). I used high-order adaptive solver for stochastic differential equations on the same problem 1000 times. This is interesting because a stochastic model is probabilistic, so this then gives the approximate distribution of the result. The code sets up the problem and then calls pmap to do a Monte Carlo simulation and solve the equation 1000 times in parallel. The code is mostly math, but there is a slight twist where some values are stored on stacks (very lightweight datastructure). To make sure I could trust the times, I ran the code 1000 times and took the average, min, and max times.
So in this case, what was best? The results speak for themselves.
Number of Workers | Average Wall Time | Max Wall Time | Min Wall Time |
---|---|---|---|
1 | 62.8732 | 64.3445 | 61.4971 |
3 | 25.749 | 26.6989 | 25.1143 |
4 | 22.4782 | 23.2046 | 21.8322 |
7 | 19.7411 | 20.2904 | 19.1305 |
8 | 19.0709 | 20.1682 | 18.5846 |
9 | 18.3677 | 18.9592 | 17.6 |
10 | 18.1857 | 18.9801 | 17.6823 |
11 | 18.1267 | 18.7089 | 17.5099 |
12 | 17.9848 | 18.5083 | 17.5529 |
13 | 17.8873 | 18.4358 | 17.3664 |
14 | 17.4543 | 17.9513 | 16.9258 |
15 | 16.5952 | 17.2566 | 16.1435 |
16 | 17.5426 | 18.4232 | 16.2633 |
17 | 16.927 | 17.5298 | 16.4492 |
Note there are two “1000”s here. I ran the Monte Carlo simulation (each solving the SDE 1000 times itself) 1000 times. I plotted the mean, max, and min times it took to solve the simulation. From the plot it’s very clear that the minimum exists somewhere around 15. 15!
What’s going on? My guess is that this is because of the time that’s not spent on the actual mathematics. Sometimes there are things performing logic, checking if statements, allocating new memory as the stacks grow bigger, etc. Although it is a math problem, there is more than just the math in this problem! Thus it seems the scheduler is able to effectively let the processes compete and more fully utilize the CPU by pushing the jobs around. This can only go so far: if you have too many workers, then you start to get cache misses and then the computational time starts to increase. Indeed, at 10 workers I could already see signs of problems in the resource manager.
However, allowing one process to start re-allocating memory but causing a cache miss (or whatever it’s doing) seems to be a good tradeoff at low levels. Thus for this code the optimal number of workers is far above the number of physical cores.
Moral of the Story
The moral is, test your code. If your code is REALLY efficient, then sure, making sure you don’t mess with your perfect code is optimal. If your code isn’t optimal (i.e. it’s just some Julia code that is pretty good and you want to parallelize it), try some higher numbers of workers. You may be shocked what happens. In this case, the compute time dropped more than 30% by overloading the number of workers.
anon
says:It is not surprising that hyperthreading helps in imperfect code. I am very surprised that you profit from more threads than logical cores (i.e. 15 > 8 = 4 * 2). Is your code using a lot of syncing? That is, if one thread/process needs to wait on a different one before it can continue, then it is of course very helpful to have more threads.
But your original problem description looked embarrasingly parallel (optimally you would use 1e6 cores for a linear speedup of 1e6 for computation).
Christopher Rackauckas
says:Yes, the original problem is embarrassingly parallel. The point is that worrying about things like cache efficiency assumes that your algorithm is actually optimized enough to be using every cycle perfectly. If you’re talking about BLAS, sure. But if you write something yourself that’s more than a single summation loop then it could be more efficient to push a few more computations in there, and the only way to find out is to test.
Ricardo Cruz
says:I have also noticed this and posted this over stackoverflow.
As you can, I was confused. I thought at first it had to do with my operating system not avoiding cache misses, so there was no trade-off. But I guess if I did some more proper empirical work such as yourself, I would see something as you noticed.
Jiahao Chen
says:Amdahl’s Law applies to every parallel program. However, the proportion of the program it estimates as not parallelizable need not be limited to the lines of code that you write specifically.
Conventional wisdom applies to statically compiled languages with manually managed memory, where you don’t have to worry about so much run time overhead. The fact that your code may look “seemingly” parallel but doesn’t exhibit perfect weak scaling is measuring the non-parallelizable overhead of everything you are using, including the Julia runtime and the specifics of how we implement pmap at present.
Christopher Rackauckas
says:Yes, we agree.
Jiahao Chen
says:Congratulations! You have just demonstrated Amdahl’s Law. Amdahl would have interpreted your results as saying that about 75% of your code can be run in parallel and 25% can only be run in serial.
Without any further details on your code, yes, you are probably measuring some overhead associated with the Julia runtime, such as the gc. But it is also likely that some part of your code is intrinsically not parallelizable, for example the overhead associated with scheduling, polling, synchronization and serialization in pmap.
Christopher Rackauckas
says:I can’t go into detail on the specific adaptive SDE solver code until it’s published (yet), but you get get similar (but not as big) results using simple Euler-Maruyama code. Since it’s just solving a bunch of SDEs in Monte Carlo, Euler-Maruyama is just independently M different times doing “for i=1:numberofsteps u = u + f(u)*dt + g(u)*randn end”. The whole point is that Monte Carlo simulations are obviously “embarrassingly parallel cases” where you would just pmap to run this 10000 times and return a vector of u at the end. Mathematically, this is “perfectly parallel”, but there’s enough “computer stuff” in the background (the stuff you mention) that pushing the number of workers above the number of cores can have a positive effect on the runtime.
I mention this because the common wisdom is always “choose the number of workers to match the number of cores”. Try it on your own code, even the stuff that is almost 100% parallel. I find that a few extra workers usually helps.
(Also, Amdahl’s Law doesn’t really apply here since in the since of Amdahl the program is “seemingly” 100% parallel (as in the code is in the pmap part for almost 100% of the time just doing math in a loop). For this kind of stuff, if you double the cores, you get almost precisely half the runtime you’re just running twice as many simulations at the same time. Here I’m not even changing the underlying resources, I am adding more workers than the number of cores, essentially just playing with how it’s scheduling rather than letting it use more processors.)