Running CUDA kernel on distributed memory with MPI

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm running my program in a cluster. Each node has 2 GPUs. Each MPI task is to call a CUDA function.

My question is if there are two mpi processes running on each node, will each CUDA function call be scheduled on different GPUs or will they both run on the same? What about if I run 4 mpi tasks on each node?

Each MPI task calls one cuda function that is scheduled on whatever GPU you choose. You can choose the GPU you want using the function cudaSetDevice() . In your case, since each node contains 2 GPUs you can switch between every GPU with cudaSetDevice(0) and cudaSetDevice(1) . If you don't specify the GPU using the SetDevice function and combining it with the MPI task rank , I believe the 2 MPI tasks will run both cuda functions on the same default GPU (numbered as 0) serially. Furthermore, if you run 3 or more mpi tasks on each node, you will have a race condition for sure, since 2 or more cuda functions will run on the same GPU serially.

Are you sure about both cuda function using same device i.e. default device ? I would assume it should be fastest unused device , which in case of 2 gpus will force to use both. – arbitUser1401 Apr 29, 2012 at 14:23 I just tested it on my own, and no matter how many mpi-tasks I created, all of them used Device 0 for the cuda calls. I don't think there's any optimization function that distributes the load,since each MPI task is independent. You have to do it manually with CudaSetDevice. – chemeng Apr 29, 2012 at 14:50

MPI and CUDA are basically orthogonal. You will have to explicitly manage MPI process-GPU affinity yourself. To do this, compute exclusive mode is pretty much mandatory for each GPU. You can use a split communicator with coloring to enforce processor-GPU affinity once each process has found a free device it can establish a context on.

Massimo Fatica from NVIDIA posted a useful code snippet on the NVIDIA forums a while ago that might get you started.

It worth noting, that in thread/process-exclusive compute-mode it's not always necessary to manage GPUs explicitly: pretty often just letting the driver choose the device would do the trick. – aland Apr 29, 2012 at 19:49

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question . Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers .