Dear all,
I have introduced an update to the job scheduling policy on the Mjolnir cluster to improve overall fairness and resource utilization.
Reason for this change
I have observed an increasing number of cases where entire compute nodes are being occupied by single jobs that do not fully utilize the allocated CPUs.
This leads to inefficient use of resources and longer wait times for other users. The new limits are intended to prevent this situation.
What has changed?
Jobs submitted under the default QOS (“normal”) are now limited to:
• Maximum 48 CPUs per job
• Maximum 48 CPUs per node
This ensures that no single job or no single user can occupy an entire compute node.
This ensures that no single job or no single user can occupy an entire compute node.
What does this mean for you?
If you currently submit jobs requesting more than 48 CPUs (e.g. via --cpus-per-task, --ntasks, or similar settings), these jobs will no longer start.
If a job exceeds this limit, you may see it remain in the queue with a reason such as:
If a job exceeds this limit, you may see it remain in the queue with a reason such as:
QOSMaxCpuPerJobLimit
If this occurs, please cancel the job, reduce the number of requested CPUs to 48 or fewer and resubmit your job.
Users who rely on existing or regularly used job scripts should review and update those scripts accordingly to comply with the new limits.
Use of nodes (important)
For most workloads on Mjolnir, you should explicitly request:
#SBATCH --nodes=1
This ensures that your job runs on a single node.
This ensures that your job runs on a single node.
If --nodes=1 is not specified, Slurm may distribute your job across multiple nodes to fulfill the requested resources (CPUs and memory), even if your application is not designed for it.
Many common tools and pipelines (e.g. standard bioinformatics tools, R, Python, etc.) are not designed to run efficiently across multiple nodes.
Requesting multiple nodes without using distributed computing frameworks (such as MPI) can lead to:
• Jobs that do not run correctly
• Significant performance degradation
• Inefficient use of cluster resources and effectively blocked resources from other users.
If your job does not explicitly support multi-node execution, requesting more than one node will not provide any benefit and may negatively impact both your job and others in the queue.
A note on performance
Requesting more CPUs does not necessarily result in faster runtimes. Many applications do not scale efficiently beyond a certain number of cores.
Requesting excessive CPUs can therefore lead to:
• Limited or no performance improvement
• Idle CPU resources
• Increased queue times for yourself and others
It is recommended to test and determine the optimal number of CPUs for your specific workload.
I apologize for any inconvenience this change may cause.
If you have workloads that genuinely require more resources, you are welcome to contact me to discuss possible solutions.
Best regards,
Bent
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article