reveal.js

SA2C Tech Chats
2020.04.28

Brief overview
of the SLURM
job scheduling algorithm

Michele Mesiti

Outline

When will my job start?
Why?
- The Multifactor Priority Plugin (Job Priority Formula)
  - Age
  - Quality of Service (QOS)
  - Partition
  - Size
  - Fairshare
  - ...
- Tools!
- Backfill Scheduling

When will my job start?

$ squeue --start

Why?

Slurm Default: FIFO
With the Multifactor Priority Plugin: ordered by priority
But if a job can be scheduled without delaying the start of any other jobs with higher priority, the backfill scheduler may launch it first

The Multifactor Priority Plugin

 
              Job_priority = (PriorityWeightAge * age_factor) + 

              (PriorityWeightQOS * QOS_factor) + 

              (PriorityWeightPartition * partition_factor) + 

              (PriorityWeightJobSize * job_size_factor) + 

              (PriorityWeightFairshare * fair-share_factor) + 

              (PriorityWeightAssoc * assoc_factor) + 

              SUM(TRES_weight_<type> * TRES_factor_<type>, ...) 

              - nice_factor + site_factor

All the factors are numbers between 0 and 1
All the weights are positive integers (should be large)

The Multifactor Priority Plugin

We can see the values of the weights in
etc/slurm/slurm.conf

or with
sprio -w

AGE FACTOR

This is the factor that will make your job more likely to start the more time it spends in the queue

It will increase linearly over time, reaching 1.0 at PriorityMaxAge

(currently, PriorityMaxAge = 7 days on Sunbird, and PriorityWeightAge = 1000)

The time spent in the queue while the job cannot run because of dependencies or because it is withheld does not count.

QOS FACTOR

Each QOS has a priority factor associated with it.

These can be visualised with
sacctmgr show qos

The priorities shown are normalised to the highest

PARTITION FACTOR

Each partition has a priority factor associated with it.

These can be visualised with
scontrol show partitions

Notice that every partition has an associated QOS that will carry its own priority factor.

SIZE FACTOR

Larger jobs get a priority increase proportional to its size (number of nodes requested).

The factor is equal to 1.0 for a job that requests all the nodes on the machine.

FAIR SHARE FACTOR

This is the complex one.

Goal: prioritize jobs from recently under-serviced accounts, and de-prioritize jobs from recently over-serviced accounts.

Two algorithms:

"Fair Tree", current
"Classic", "legacy"

FAIR SHARE FACTOR - Associations

What is an association?

From the documentation:

The association is a combination of cluster, account, user names and optional partition name.

We can look at all of them with the command
sshare, or
sacctmgr show associations

FAIR SHARE FACTOR - Account Tree

From the SLURM online documentation

FAIR SHARE FACTOR - Algorithm - 1/3

Set rank=user_assoc_count

Then:

Calculate Level Fairshare for the subtree children
Sort children of the subtree
Visit the children in descending order
- If user, assign a final fairshare factor similar to
  (rank--/user_assoc_count)
- If account, descend to account

(From the docs)

FAIR SHARE FACTOR - Algorithm - 2/3

The Level Fairshare (LF) is computed as:
LF = S/U
Where

S = Sraw_self/Sraw_{self+siblings} ,

and

U = Uraw_self/Uraw_{self+siblings} ,

Where Sraw represents the shares assigned to the association, while Uraw represents the resource usage...

(From the docs)

FAIR SHARE FACTOR - Algorithm - 3/3

The Resource usage is actually computed using a decaying exponential:

 U_H= U_t + 
							D U_t-1 + D² U_t-2

Where D is such that it gives a half-life as set by the PriorityDecayHalfLife variable.

(From the docs)

(Or, at least, this is true in the Classic Fair Share algorithm, there is no clear mention of this in the Fair Tree algorithm page).

Some Tools

sprio -l: look at terms in the priority formula
sshare -l look at fair share-related quantities

Backfill Scheduling

At some kinds of event a fast job scheduling is attempted (job submission/completion, config changes...)
"Backfill scheduling is a time consuming operation."
Governed by the SchedulerParameter variables in slurm.conf

Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs.

(From the docs and this presentation)

Brief overview of the SLURM job scheduling algorithm

Outline

When will my job start?

Why?

The Multifactor Priority Plugin

The Multifactor Priority Plugin

AGE FACTOR

QOS FACTOR

PARTITION FACTOR

SIZE FACTOR

FAIR SHARE FACTOR

FAIR SHARE FACTOR - Associations

FAIR SHARE FACTOR - Account Tree

FAIR SHARE FACTOR - Algorithm - 1/3

FAIR SHARE FACTOR - Algorithm - 2/3

FAIR SHARE FACTOR - Algorithm - 3/3

Some Tools

Backfill Scheduling

Thank you for your attention!

Brief overview
of the SLURM
job scheduling algorithm