Elements of Computer Architecture and their effects
Compute balance: Raw compute power is cheap, memory bandwidth is not. This is why most HPC workloads are memory bound.
Tools for system investigation
To understand what are the limits of the system we are working on, we can use some predefined benchmarks to assess the performance of the system.
likwid-bench: suite of tunable micro benchmarks for CPU and main memory
fio for IO benchmarks and tests
OSU Micro Benchmarks: suite of micro benchmarks for different types of MPI communication
Trade-offs
Demo: Hardware exploration with LIKWID
What is the structure of the CPU?
likwid-topology
can help us.
We can use likwid-bench to explore the limits of our hardware.
For an exploration, see this notebook
Memory: Size vs Speed and latency.
System |
Bandwidth |
Latency |
Size |
Notes |
|---|---|---|---|---|
Registers |
* |
< 1cy |
1KB |
per core |
L1 cache |
<100GB/s |
~1ns |
32KB |
per core |
L2 cache |
<100GB/s |
~4ns |
512KB |
per core |
L3 cache |
1.25 TB/s |
~16ns |
50MB |
per package, shared |
HBM |
2-8TB/s |
64GB |
per package |
|
Main Memory |
50-400GB/s |
~100ns(*) |
200GB |
per node |
Network (IB) |
<100GB/s |
- |
||
Local SDD |
~1GB/s |
~1TB |
||
Global Filesystem |
~50GB/s |
~PBs |
||
Tape |
~1min |
With likwid-topology and likwid-bench we can get
more precise values or better estimates.
Challenge: What if our data structures do not fit in memory/storage system X?
Memory granularity:
Registers (there’s only a handful of them unfortunately)
Cache line (64B on most CPUs, 64B-128B on GPUs)
Memory Pages: >4MB
Parallelism in the core: Vectorization and superscalarity
CPUs can do more things at the same time in two main ways:
Vectorization (SIMD, Single Instruction Multiple Data):
Superscalarity and pipelining
Vector operations
Note: Vector loads/store have typically higher bandwidth,
especially when fetching data from L1
(verify with likwid-bench).
Superscalarity
Caches
Data is moved between cache levels and between caches and the RAM only in “chunks” of 64Bytes (the size of a cache line).
This means that if we use only a single value out of a 64B cache line, we will potentially trash a lot of data.
Cache is typically not fully associative: not all 64B chunks of the main memory can be “mirrored” in all cache lines, but typically each of them can only fit in a specific “cache line group”.
Some access patterns can lead to poor cache utilization due to cache associativity.
Cache levels can also be inclusive, esclusive or neither (cache inclusion policy).
False sharing
What is the best way to split embarrassingly parallel work on an array by N threads?
Thread 0 works on element 0, Thread 1 works on element 1 and in general Thread
iworks on elementsk*nthreads+i;Array is split in
nthreadscontiguous chunks, every thread works on a chunk.
Solution
When working with multiple threads, be aware that different threads working on different cores accessing the same data might “fight” over a single cache line, as the hardware is forced to give a consistent view of the data to both threads. In that situation, called false sharing, the workload becomes latency-bound. For this reason, typically the best approach is 2.
Note that this is exactly the opposite to what is done when writing GPU kernels, where each thread in a thread block works typically on data that is close to the data on which the other members of the thread bloc operate: we are operating at another granularity level.
Main memory
Rows and columns: memory is usually accessed in rows and columns. If different data is read from the same row repeatedly, then access is typically noticeably faster. This is another reason why sequential access can be much faster than random access.
Non-Uniform-Memory-Access
On modern and powerful compute nodes it is very common nowadays the situation where not all compute units can access the main memory in an uniform way.
Situation |
Bandwidth |
Latency |
|---|---|---|
Same domain |
165GB/s |
90 ns |
Different domain |
55GB/s |
140 ns |
Simple performance modelling: the roofline model
Recommendations
Ensuring Data reuse is paramount:
use as much data as possible out of every cache line before it is evicted (split “hot” and “cold” data between different structures/arrays)
reuse as much as possible data that has been read or written shortly before