SA2C Tech Chats
2020.04.28

Brief overview
of the SLURM
job scheduling algorithm

Michele Mesiti

Outline

  • When will my job start?
  • Why?
    • The Multifactor Priority Plugin (Job Priority Formula)
      • Age
      • Quality of Service (QOS)
      • Partition
      • Size
      • Fairshare
      • ...
    • Tools!
    • Backfill Scheduling

When will my job start?


$ squeue --start

Why?

  • Slurm Default: FIFO
  • With the Multifactor Priority Plugin: ordered by priority
  • But if a job can be scheduled without delaying the start of any other jobs with higher priority, the backfill scheduler may launch it first

The Multifactor Priority Plugin

Job_priority = (PriorityWeightAge * age_factor) +
(PriorityWeightQOS * QOS_factor) +
(PriorityWeightPartition * partition_factor) +
(PriorityWeightJobSize * job_size_factor) +
(PriorityWeightFairshare * fair-share_factor) +
(PriorityWeightAssoc * assoc_factor) +
SUM(TRES_weight_<type> * TRES_factor_<type>, ...)
- nice_factor + site_factor
  • All the factors are numbers between 0 and 1
  • All the weights are positive integers (should be large)

The Multifactor Priority Plugin



We can see the values of the weights in
etc/slurm/slurm.conf

or with
sprio -w

AGE FACTOR

This is the factor that will make your job more likely to start the more time it spends in the queue

It will increase linearly over time, reaching 1.0 at PriorityMaxAge

(currently, PriorityMaxAge = 7 days on Sunbird, and PriorityWeightAge = 1000)

The time spent in the queue while the job cannot run because of dependencies or because it is withheld does not count.

QOS FACTOR

Each QOS has a priority factor associated with it.

These can be visualised with
sacctmgr show qos

The priorities shown are normalised to the highest

PARTITION FACTOR

Each partition has a priority factor associated with it.

These can be visualised with
scontrol show partitions

Notice that every partition has an associated QOS that will carry its own priority factor.

SIZE FACTOR

Larger jobs get a priority increase proportional to its size (number of nodes requested).

The factor is equal to 1.0 for a job that requests all the nodes on the machine.

FAIR SHARE FACTOR

This is the complex one.

Goal: prioritize jobs from recently under-serviced accounts, and de-prioritize jobs from recently over-serviced accounts.

Two algorithms:

FAIR SHARE FACTOR - Associations

What is an association?

From the documentation:

The association is a combination of cluster, account, user names and optional partition name.

We can look at all of them with the command
sshare, or
sacctmgr show associations

FAIR SHARE FACTOR - Account Tree


From the SLURM online documentation

FAIR SHARE FACTOR - Algorithm - 1/3

  • Set rank=user_assoc_count

Then:
  1. Calculate Level Fairshare for the subtree children
  2. Sort children of the subtree
  3. Visit the children in descending order
    • If user, assign a final fairshare factor similar to
      (rank--/user_assoc_count)
    • If account, descend to account

(From the docs)

FAIR SHARE FACTOR - Algorithm - 2/3

The Level Fairshare (LF) is computed as:
LF = S/U
Where
S = Srawself/Srawself+siblings ,
and
U = Urawself/Urawself+siblings ,

Where Sraw represents the shares assigned to the association, while Uraw represents the resource usage...

(From the docs)

FAIR SHARE FACTOR - Algorithm - 3/3

The Resource usage is actually computed using a decaying exponential:
UH= Ut + D Ut-1 + D2 Ut-2
Where D is such that it gives a half-life as set by the PriorityDecayHalfLife variable.

(From the docs)

(Or, at least, this is true in the Classic Fair Share algorithm, there is no clear mention of this in the Fair Tree algorithm page).

Some Tools


  • sprio -l: look at terms in the priority formula
  • sshare -l look at fair share-related quantities

Backfill Scheduling

  • At some kinds of event a fast job scheduling is attempted (job submission/completion, config changes...)
  • "Backfill scheduling is a time consuming operation."
  • Governed by the SchedulerParameter variables in slurm.conf
Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs.

(From the docs and this presentation)

Thank you for your attention!