The closed loop of performance tuning

Before starting: ensuring correctness

Performance tuning requires to change the software in some way, and this might unadvertently break it.

Discussion

What do you think it is the most practical way to guarantee that the performance optimization work does not break the code?

How often would you check?

Performance tuning workflow

Once a way to ascertain the correctness of the code is available (preferably quick and automated), performance optimization typically proceeds in an iterative fashion.

“Scientific” workflow:

  1. Make a falsifiable hypothesis (why is my code slow?)

  2. Get data and analyse it, plan code changes (design experiment)

  3. Action: implement changes (perform experiment)

Another, more concrete, view of the same approach:

Phases of the Performance Tuning loop

Step

Possible problems

Problem mitigation
(discuss)

  1. Measurement

What to measure? And how?

Domain knowledge
(which use cases are relevant?)
Computer arch. knowledge
Performance tool knowledge

  1. Analysis

Performance analysis tools might produce
a lot of data.
What is relevant?

Computer arch. knowledge
(Performance tool knowledge)

3 Generation of
Alternatives

How to make hypotheses?

Computer arch. knowledge

  1. Implementation

Is a code change worth it?
(complexity vs. performance).
Redundancy of optimizations

Version Control,
Proper architecture,
Domain knowledge
(definition of
relevant use cases)

It is also important:

  • to keep track of the progress done during the iterations of the performance optimization loop and of the information (or lack thereof) that have influenced any decision. This can be done, for example, with a logbook.

  • to strive for performance reproducibility, keeping all sources of performance irreproducibility under control

Porting to accelerators, Parallelization and performance engineering

In order for software to run efficiently modern HPC clusters, typically it needs to be capable of:

  • using multiple cores on the same host (shared memory)

  • using multiple nodes (distributed memory)

  • using accelerators

Some care must be taken when porting code to take advantage of HPC hardware:

  • node-level performance should be understood and optimized before attempting parallelization (it is common that software that performs poorly at the node level seems to scale well on multiple nodes).

  • Reaching good performance on GPU might require changes not only in the algorithms but also in the in the way data is stored in memory (the memory layout). In general, performance engineering is an integral part of porting software to another hardware architecture.

Moreover: notice that the available hardware typically changes every few years, and that the useful life of scientific software is typically longer. Maintenance costs can be reduced by making use of performance portability frameworks.