python - How does torch.distributed.barrier() work

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I've read all the documentations I could find about torch.distributed.barrier(), but still having trouble understanding how it's being used in this script and would really appreciate some help.

So the official doc of torch.distributed.barrier says it "Synchronizes all processes.This collective blocks processes until the whole group enters this function, if async_op is False, or if async work handle is called on wait()."

It's used in two places in the script:

First place

    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
        ... (preprocesses the data and save the preprocessed data)
    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier() 
Second place
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
        ... (loads the model and the vocabulary)
    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
I'm having trouble relating the comment in the code to the functionality of this function stated in the official doc. How does it make sure only the first process executes the code between the two calls of torch.distributed.barrier() and why it only checks whether the local rank is 0 before the second call? 
Thanks in advance!
First you need to understand the ranks. To be brief: in a multiprocessing context we typically assume that rank 0 is the first process or base process. The other processes are then ranked differently, e.g. 1, 2, 3, totalling four processes in total.
Some operations are not necessary to be done in parallel or you just need one process to do some preprocessing or caching so that the other processes can use that data.
In your example, if the first if statement is entered by the non-base processes (rank 1, 2, 3), they will block (or "wait") because they run into the barrier. They wait there, because barrier() blocks until all processes have reached a barrier, but the base process has not reached a barrier yet.
So at this point the non-base processes (1, 2, 3) are blocked, but the base process (0) continues. The base process will do some operations (preprocess and cache data, in this case) until it reaches the second if-statement. There, the base process will run into a barrier. At this point, all processes have stopped at a barrier, meaning that all current barriers can be lifted and all processes can continue. Because the base process prepared the data, the other processes can now use that data.
Perhaps the most important thing to understand is:
when a process encounters a barrier it will block
the position of the barrier is not important (not all processes have to enter the same if-statement, for instance)
a process is blocked by a barrier until all processes have encountered a barrier, upon which those barriers are lifted for all processes
                Thanks for clarifying: "a process is blocked by a barrier until all processes have encountered a barrier, upon which the barrier is lifted for all processes" makes sense
– vgoklani
                Jan 23, 2020 at 21:44
                Thanks for your helpful explanation. But I am not clear about the second point. If non-base processes are waiting at barrier, even the base process won't get into the if code and when it pass the if condition, do you mean all processes will continue?
– Qinsheng Zhang
                Feb 16, 2021 at 15:53
                It is quite strange, the base process does not execute barrier(), how can other process get notified that all processes are ready so that the barrier can be removed?
– Qinsheng Zhang
                Feb 16, 2021 at 15:55
                @QinshengZhang The processes do not have to enter the same barrier, just a barrier. So if you write an if-function that only some processes enter, then they will only continue if the other processes reach another barrier.
– Bram Vanroy
                Feb 17, 2021 at 8:59
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.