memory - Error in SLURM cluster - Detected 1 oom-kill event(s): how to improve running jobs

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script.

At the end of the job, the process was killed and this is the error I obtained.

slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.
My guess is that there is some issue with memory. But how can I know more about?
Did I not provide enough memory? or as user I was requesting more than what I have access to? 
Any suggestion?
The approved answer is correct but, to be more precise, error
slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.
indicates that you are low on Linux's CPU RAM memory.
If you were, for instance, running some computation on GPU, requesting more GPU memory than what is available will result in an error like this (example for PyTorch):
RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached)
Check out the explanation in this article for more details.
Solution:
Increase or add in your script parameter --mem-per-cpu.
If you are using sbatch: sbatch your_script.sh to run your script, add in it following line:
#SBATCH --mem-per-cpu=<value bigger than you've requested before>
If you are using srun: srun python3 your_script.py add this parameter like this:
srun --mem-per-cpu=<value bigger than you've requested before> python3 your_script.py
Here OOM stands for "Out of Memory". When Linux runs low on memory, it will "oom-kill" a process to keep critical processes running. It looks like slurmstepd detected that your process was oom-killed. Oracle has a nice explanation of this mechanism.
If you had requested more memory than you were allowed, the process would not have been allocated to a node and computation would not have started. It looks like you need to request more memory.
I missed the scheduler="processes" param.
mean_squared_errors = dask.compute(*delayed_results, scheduler="processes")
I'm also on a SLURM cluster, and fixing this oversight fixed my issue.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.