Reproducibility problems in performance measurements

Consistent results in measurements are paramount to properly inform judgement and action when doing performance optimization.

Common sources of irreproducibility

If you repeat a performance measurement and you notice an unexpected variation, consider these aspects:

Non-comprehensive list of sources of irreproducibility, and their mitigations

Class

Cause

Mitigation/Solution

Notes

Software

Software versions
(dependencies)

Proper dependency tracking,
with pinned
(.lock files, Manifest.toml)
under version control

Compiler flags

Automation of builds,
build scripts under VC

(also for Julia)

Hardware

CPU frequency variations

likwid-setFrequencies,
--cpu-freq SLURM flag,
system monitoring
(CC, JM),
MachineState

factor of 4 observed

Microcode updates

Track microcode versions
(cat /proc/cpuinfo)

(Rare)

Multithreading

Thread migration
(process migration)

thread/process pinning
with OMP/MPI
(likwid-pin,
ThreadPinning.jl)

Dynamic thread scheduling

Use static scheduling instead
(if reasonable)

dynamic scheduling necessary
for load balancing (at times)

“Noisy Neighbour”

node sharing

use --exclusive
sbatch allocation

shared filesystem
congestion

Isolate, manage
and monitor I/O

Network congestion

topology control
via sbatch