I’m running channelflow on a cluster with 28 compute nodes. Each node has two 12-core Xeon E5-2680 CPUS.
I envision two scales of channelflow computations:
- small, running on one node with 24 cores, 128 x 129 x 128 discretization
- large, running on, say, sixteen nodes with 384 cores, 1024 x 129 x 1024 discretization
What are decent starting point for values of -np0
and -np1
for these two simulations?
Hello John,
there is only one hard requirement for the MPI distribution which is mod(Nx,np0)=0
. The reason is the way the FFTW transform plans are set up. If you do not respect this, you will see a runtime error. In your case, any power of 2 will work for np0
.
I typically use a distribution which is close to equally distributed because makes the data chunks most likely to be the same on each process (“pencil distribution”). In your case, this would be (np0,np1)=(4,6)
for small, and (np0,np1)=(16,24)
for large. However, it might be worth to test if a “slab distribution” is faster in your case. This means that only one dimension is distributed, let’s say (np0,np1)=(1,384)
for large. The advantage is that you save cost for communication. The disadvantage is that it is more likely to distribute the data unequally over the tasks. np1 divides the x-dimension in physical space and the z-dimension in spectral. If your numerical domain is very long but narrow, a “slab distribution” is probably a bad idea but your domain has equal sides and it might be more performant.
1 Like
On the computer and system that we use (IBM x3750) we find that (np0,np1)=(1,np) is optimal for our (Lx,Lz)=(10,40) domains, at least up to np=64.
Laurette: I experience the same on my Intel Xeon CPU E5-2680 cluster. (np0, np1) = (1,np) is optimal for all np I’ve tested, up to 64.